Transformer’s Scaled Dot-Product Attention

Scaled Dot Product attention from [1]
  • Original values of
[10, 20, 30, …90]
  • scale by 10, we have
[1, 2, 3, …9]
  • no scale, we have
[10, 20, 30, …90]

Code used for the vis

import torch
from torch import nn
from matplotlib import pyplot as plt
f = nn.Softmax(dim=1)
x = torch.tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9], [0.1, 0.20, 0.3, 0.40, 0.50, 0.6, 0.7, 0.8, 0.9]], dtype=torch.float)
y = f(x)
plt.scatter(range(x.shape[1]), y[0], c="r", label="no scale")
plt.scatter(range(x.shape[1]), y[1], c="g", label="scale by 10")
plt.legend()
plt.show()

Reference

--

--

--

Data Scientist/MLE/SWE @takemobi

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Beginning of journey in the world of Machine learning

Training Neural Networks for price prediction with TensorFlow

The Knowledge Base in NLP

IMDb Review sentiment analysis using Logistic Regression

Linear Regression With Python

Computer Vision — Python #1

Linear Regression With PyTorch in Python

Comparing Google Cloud Platform, AWS and Azure

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jimmy Shen

Jimmy Shen

Data Scientist/MLE/SWE @takemobi

More from Medium

Review — RoBERTa: A Robustly Optimized BERT Pretraining Approach

paper summary: “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation…

Integrating Ray Tune, Hugging Face Transformers and W&B

Question Answering for Dravidian Languages — Hindi and Tamil