Awesome Transformer Tutorial and blogs

Jimmy (xiaoke) Shen
2 min readMar 14, 2022

--

In this article, I am not summarizing what Transformer is. I am only listing the materials that help me as those materials are really clear.

Article and blogs about attention

Visualizing A Neural Machine Translation Model

Attention and Augmented Recurrent Neural Networks

Blog

MUST SEE . It is so clear explained that it can solve most of your problems.

Code

Nice code from here by Tae-Hwan Jung. It has reference from the MUST SEE blog.

Nice code from scratch

Course from hugging face

Others

Relationship of dot product to matrix multiplication.

Why scaled dot production?

The number or parameter for each layer?

This is an open question. One thing is pretty interesting is for the last layer of decoder, it is a MLP with output size of the d_model * vocabulary size as we should predict the prob for each token by using softmax layer. From the paper, it says for some task, the d_model is 1024 and 32K token are used, then the parameter of this last softmax layer is pretty large which is: 1024*32K = 32M. Not sure whether my understanding is correct or not. Will verify this later.

In case you know Mandarin, some good matericals can be found here:

Video by MULI from AWS

Video by Professor Hongyi Li

Videos by DASOU. Tae-Hwan Jung’s code explanation by DASOU.

Video by 霹雳吧啦Wz

Nice video by 就是不吃草的羊

this post is all you need

--

--

No responses yet