Transformer number of parameters estimation

Jimmy (xiaoke) Shen

3 min readNov 8, 2022

--

From the original paper we know that base has about 65M parameters, while the big has 213M parameters.

From [1]

The question is how to compute those number of parameters?

How to computer number of parameters?

Transformer architecture

We can cut the transformer into 3 main parts:

Encoder:
Decoder
Linear

For the encoder and decoder, we also have the components of:

Multi head Attention
Layer Norm
Feed Forward

The number of parameters can be estimated as following [2]:

Multi Head Attention (MHA): single heard: d_model * d_k*3*h (h is the number of head, we have 3 linear operation in the bottom of above figure)+ d_model*d_model (the top linear project layer after Concat of multiple head). As d_k*h = d_model, we can update the above to : 4*d_model*d_model. If add bias, we…

Jimmy (xiaoke) Shen

Written by Jimmy (xiaoke) Shen

Data Scientist/MLE/SWE @takemobi

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams