From the original paper we know that base has about 65M parameters, while the big has 213M parameters.
The question is how to compute those number of parameters?
How to computer number of parameters?
We can cut the transformer into 3 main parts:
For the encoder and decoder, we also have the components of:
- Multi head Attention
- Layer Norm
- Feed Forward
The number of parameters can be estimated as following :
- Multi Head Attention (MHA): single heard: d_model * d_k*3*h (h is the number of head, we have 3 linear operation in the bottom of above figure)+ d_model*d_model (the top linear project layer after Concat of multiple head). As d_k*h = d_model, we can update the above to : 4*d_model*d_model. If add bias, we will have 4*(d_model * d_model + d_model)
- Layer Norm: 2*d_model
- Feed Forward: we have the hidden layer of d_ff, so the fully connected layer will be changed to d_ff and then back to d_model: d_model*d_ff + d_ff*d_model = 2*(d_model*d_ff). If adding bias, we will have: d_model*d_ff + d_ff + d_ff*d_model + d_model = 2*d_model*d_ff + d_model + d_ff
- Encoder: N*(MHA + 2*layer Norm)
- Decoder: N*(2*MHA + 3*layer Norm)
- Linear: d_model * vocabulary_token_number + vocabulary_token_number
Based on above, we have the following result:
The estimation in above table does not exactly match with the original paper’s table 4, however, the number is pretty close. (Orginal base: 65M, big: 213M). I believe this can be a good estimation. Leave me a comment if any part has errors.
Transformer based on model parameter calculator
A caculator based on those observation prepared by the author can be found here. I set to read only, not sure whether you can make a copy and edit. If not, do let me know.
Some visualization of transformer
Above is a quick comparison of the transformer’s base and big model. We can see for the big model, the ratio of Linear layer is reduced as there are more parameter for the encoder and decoder layer. This may be one of the reason that the big model has a better performance.
As the large vocabulary token size (37K), we can see that the linear part for the transformer spent about 1/3 of the total number of parameters. Another thing we can see for transformer is the most computation is in the format of matrix multiplication.
Her is the big model’s distribution for reference.