Transformer’s Scaled Dot-Product Attention

In the paper transformer, it propose the Scaled Dot-Product Attention:

Scaled Dot Product attention from [1]

it claims that

Why divide by a value can push the softmax output to regions not close to extreme values? Let’s do a quick experiments:

  • Original values of
[10, 20, 30, …90]
  • scale by 10, we have
[1, 2, 3, …9]
  • no scale, we have
[10, 20, 30, …90]

Softmax outputs visualization

Indeed, it works, the reason went to the exponential function, as

e⁹ / e⁸ is much larger than e^{0.9} / e^{0.8}

Code used for the vis

import torch
from torch import nn
from matplotlib import pyplot as plt
f = nn.Softmax(dim=1)
x = torch.tensor([[1, 2, 3, 4, 5, 6, 7, 8, 9], [0.1, 0.20, 0.3, 0.40, 0.50, 0.6, 0.7, 0.8, 0.9]], dtype=torch.float)
y = f(x)
plt.scatter(range(x.shape[1]), y[0], c="r", label="no scale")
plt.scatter(range(x.shape[1]), y[1], c="g", label="scale by 10")
plt.legend()
plt.show()

Reference

[1] https://arxiv.org/pdf/1706.03762.pdf

--

--