Awesome Transformer Tutorial and blogs

2 min readMar 14, 2022

In this article, I am not summarizing what Transformer is. I am only listing the materials that help me as those materials are really clear.

Article and blogs about attention

Visualizing A Neural Machine Translation Model

Attention and Augmented Recurrent Neural Networks

Blog

MUST SEE . It is so clear explained that it can solve most of your problems.

Code

Nice code from here by Tae-Hwan Jung. It has reference from the MUST SEE blog.

Nice code from scratch

Course from hugging face

Others

Relationship of dot product to matrix multiplication.

Why scaled dot production?

The number or parameter for each layer?

This is an open question. One thing is pretty interesting is for the last layer of decoder, it is a MLP with output size of the d_model * vocabulary size as we should predict the prob for each token by using softmax layer. From the paper, it says for some task, the d_model is 1024 and 32K token are used, then the parameter of this last softmax layer is pretty large which is: 1024*32K = 32M. Not sure whether my understanding is correct or not. Will verify this later.