Universal Transformers

Rachel Draelos, MD, PhD
Towards Data Science
10 min readSep 7, 2019

--

This post will discuss the Universal Transformer, which combines the original Transformer model with a technique called Adaptive Computation Time. The main innovation of Universal Transformers is to apply the Transformer components a different number of times for each symbol.

Paper Reference

Dehghani M, Gouws S, Vinyals O, Uszkoreit J, Kaiser Ł. Universal transformers. ICLR 2019.

Background and Transformer Review

If you are not already familiar with the Transformer model, you should read through “The Transformer: Attention is All You Need.” The Universal Transformer is a simple modification of the Transformer, so it is important to understand the Transformer model first.

If you’re already familiar with the Transformer model and would like a quick review, here goes:

Figure modified from Transformer paper

The base Transformer consists of an encoder and a decoder:

Encoder:

  • 6 encoder layers
  • Each encoder has 2 sub-layers: (1) multi-head self-attention; (2) feed-forward

Decoder:

  • 6 decoder layers
  • Each decoder layer has 3 sub-layers: (1) masked multi-head self-attention; (2) encoder-decoder multi-head attention; (3) feed forward

Here’s a one-figure review of multi-head attention, one of the Transformer model’s key innovations:

Multi-head attention is used for encoder self-attention (which takes as input the previous encoder layer output), decoder self-attention (which takes as input the previous decoder layer output), and encoder-decoder attention (which uses the final encoder output for the keys and values and the previous decoder output as the queries.) In the figure above the parts of the model where multi-head attention is used are boxed in red on the left. On the right, the dimensions of the Tensors at each part of the multi-head…

--

--

I am a physician with a PhD in Computer Science. My research focuses on machine learning methods development for medical data. I am the CEO of Cydoc.