Awesome Transformer Tutorial and blogs
In this article, I am not summarizing what Transformer is. I am only listing the materials that help me as those materials are really clear.
MUST SEE . It is so clear explained that it can solve most of your problems.
Nice code from here by Tae-Hwan Jung. It has reference from the MUST SEE blog.
The number or parameter for each layer?
This is an open question. One thing is pretty interesting is for the last layer of decoder, it is a MLP with output size of the d_model * vocabulary size as we should predict the prob for each token by using softmax layer. From the paper, it says for some task, the d_model is 1024 and 32K token are used, then the parameter of this last softmax layer is pretty large which is: 1024*32K = 32M. Not sure whether my understanding is correct or not. Will verify this later.
In case you know Mandarin, some good matericals can be found here:
Video by MULI from AWS
Video by Professor Hongyi Li
Video by 霹雳吧啦Wz
Nice video by 就是不吃草的羊