Awesome Transformer Tutorial and blogs
In this article, I am not summarizing what Transformer is. I am only listing the materials that help me as those materials are really clear.
Article and blogs about attention
Visualizing A Neural Machine Translation Model
Attention and Augmented Recurrent Neural Networks
Blog
MUST SEE . It is so clear explained that it can solve most of your problems.
Code
Nice code from here by Tae-Hwan Jung. It has reference from the MUST SEE blog.
Others
Relationship of dot product to matrix multiplication.
Why scaled dot production?
The number or parameter for each layer?
This is an open question. One thing is pretty interesting is for the last layer of decoder, it is a MLP with output size of the d_model * vocabulary size as we should predict the prob for each token by using softmax layer. From the paper, it says for some task, the d_model is 1024 and 32K token are used, then the parameter of this last softmax layer is pretty large which is: 1024*32K = 32M. Not sure whether my understanding is correct or not. Will verify this later.
In case you know Mandarin, some good matericals can be found here:
Video by MULI from AWS
Video by Professor Hongyi Li
Videos by DASOU. Tae-Hwan Jung’s code explanation by DASOU.
Video by 霹雳吧啦Wz
Nice video by 就是不吃草的羊