Universal Transformers
This post will discuss the Universal Transformer, which combines the original Transformer model with a technique called Adaptive Computation Time. The main innovation of Universal Transformers is to apply the Transformer components a different number of times for each symbol.
Paper Reference
Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser Ł. Universal transformers. ICLR 2019.
Background and Transformer Review
If you are not already familiar with the Transformer model, you should read through “The Transformer: Attention is All You Need.” The Universal Transformer is a simple modification of the Transformer, so it is important to understand the Transformer model first.
If you’re already familiar with the Transformer model and would like a quick review, here goes:
The base Transformer consists of an encoder and a decoder:
Encoder:
- 6 encoder layers
- Each encoder has 2 sub-layers: (1) multi-head self-attention; (2) feed-forward
Decoder:
- 6 decoder layers
- Each decoder layer has 3 sub-layers: (1) masked multi-head self-attention; (2) encoder-decoder multi-head attention; (3) feed forward
Here’s a one-figure review of multi-head attention, one of the Transformer model’s key innovations:
Multi-head attention is used for encoder self-attention (which takes as input the previous encoder layer output), decoder self-attention (which takes as input the previous decoder layer output), and encoder-decoder attention (which uses the final encoder output for the keys and values and the previous decoder output as the queries.) In the figure above the parts of the model where multi-head attention is used are boxed in red on the left. On the right, the dimensions of the Tensors at each part of the multi-head…