Universal Transformers

Published in

Towards Data Science

10 min readSep 7, 2019

This post will discuss the Universal Transformer, which combines the original Transformer model with a technique called Adaptive Computation Time. The main innovation of Universal Transformers is to apply the Transformer components a different number of times for each symbol.

Paper Reference

Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser Ł. Universal transformers. ICLR 2019.

Background and Transformer Review

If you are not already familiar with the Transformer model, you should read through “The Transformer: Attention is All You Need.” The Universal Transformer is a simple modification of the Transformer, so it is important to understand the Transformer model first.

If you’re already familiar with the Transformer model and would like a quick review, here goes:

The base Transformer consists of an encoder and a decoder:

Encoder:

6 encoder layers
Each encoder has 2 sub-layers: (1) multi-head self-attention; (2) feed-forward

Decoder:

6 decoder layers
Each decoder layer has 3 sub-layers: (1) masked multi-head self-attention; (2) encoder-decoder multi-head attention; (3) feed forward

Here’s a one-figure review of multi-head attention, one of the Transformer model’s key innovations:

Multi-head attention is used for encoder self-attention (which takes as input the previous encoder layer output), decoder self-attention (which takes as input the previous decoder layer output), and encoder-decoder attention (which uses the final encoder output for the keys and values and the previous decoder output as the queries.) In the figure above the parts of the model where multi-head attention is used are boxed in red on the left. On the right, the dimensions of the Tensors at each part of the multi-head…

Universal Transformers

Paper Reference

Background and Transformer Review

Written by Rachel Draelos, MD, PhD