Transformer number of parameters estimation
3 min readNov 8, 2022
From the original paper we know that base has about 65M parameters, while the big has 213M parameters.
The question is how to compute those number of parameters?
How to computer number of parameters?
We can cut the transformer into 3 main parts:
- Encoder:
- Decoder
- Linear
For the encoder and decoder, we also have the components of:
- Multi head Attention
- Layer Norm
- Feed Forward
The number of parameters can be estimated as following [2]:
- Multi Head Attention (MHA): single heard: d_model * d_k*3*h (h is the number of head, we have 3 linear operation in the bottom of above figure)+ d_model*d_model (the top linear project layer after Concat of multiple head). As d_k*h = d_model, we can update the above to : 4*d_model*d_model. If add bias, we…