Transformer number of parameters estimation

Jimmy (xiaoke) Shen
3 min readNov 8, 2022

From the original paper we know that base has about 65M parameters, while the big has 213M parameters.

From [1]

The question is how to compute those number of parameters?

How to computer number of parameters?

Transformer architecture

We can cut the transformer into 3 main parts:

  • Encoder:
  • Decoder
  • Linear

For the encoder and decoder, we also have the components of:

  • Multi head Attention
  • Layer Norm
  • Feed Forward

The number of parameters can be estimated as following [2]:

  • Multi Head Attention (MHA): single heard: d_model * d_k*3*h (h is the number of head, we have 3 linear operation in the bottom of above figure)+ d_model*d_model (the top linear project layer after Concat of multiple head). As d_k*h = d_model, we can update the above to : 4*d_model*d_model. If add bias, we…