Awesome Transformer Tutorial and blogs


MUST SEE . It is so clear explained that it can solve most of your problems.


Nice code from here by Tae-Hwan Jung. It has reference from the MUST SEE blog.


Relationship of dot product to matrix multiplication.

Why scaled dot production?

The number or parameter for each layer?

This is an open question. One thing is pretty interesting is for the last layer of decoder, it is a MLP with output size of the d_model * vocabulary size as we should predict the prob for each token by using softmax layer. From the paper, it says for some task, the d_model is 1024 and 32K token are used, then the parameter of this last softmax layer is pretty large which is: 1024*32K = 32M. Not sure whether my understanding is correct or not. Will verify this later.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jimmy Shen

Jimmy Shen

Data Scientist/MLE/SWE @takemobi