English to Katakana with Sequence to Sequence in PyTorch
This my second article reposted in Bloggie.
In the previous article, I wrote about translating English words into Katakana using Sequence-to-Sequence learning in Tensorflow (Keras). For this article, I describe how to implement the same model in PyTorch.
Note: This example is written in Python 3.7 and PyTorch 1.1
All data and code are available on Github.
Data Preparation
We will use the Japanese-English name pairs dataset similar to the previous article. More details about Japanese Katakana and the dataset can be read there.
We will also need to apply the same data transformation:
- Build encoding dictionary (characters to IDs)
- Encode or Transform the names into the sequences of IDs
- Also append PADDING characters (0’s) at the end to make equal-length sequences
Those are already implemented as build_characters_encoding
and transform
functions in the katakana/encoding.py
Sequence-to-Sequence in PyTorch
Encoder
We implement the encoder as a PyTorch’s Module. The encoder consists of embedding
(Embedding) and lstm
(LSTM). The module embeds the input with embedding, pass the embedded input into the LSTM, then the module’s output is the final time step of LSTM output.
Note: we set batch_first=True
to make Pytorch's LSTM taking input with dimensions (batch_size, sequnece_size, vector_size) similar to TensorFlow's LSTM.
Decoder
Here, we also implement the decoder as a PyTorch’s Module. The module consists of embedding
, lstm
, and linear
(Linear or Dense). It takes two inputs, decoder_input_sequences
and encoder_output
.
Similar to the encoder, the decoder embeds input the sequence and pass the embeded sequence to LSTM. However, this time, we initialize the LSTM's state with encoder's output. The LSTM's output are then passed into the linear layer to produce the final output.
Note: Unlike TensorFlow’s version, we don't apply Softmax activation to the final output to make it easier to apply CrossEntropyLoss (see "Training the model"). Applying the Softmax also won't change the result when we use the decoder to generate the output greedily (see "Testing the model”).
Putting them together
Finally, we combine Encoder and Decoder into the PyTorch’s Module for Sequence-to-Sequence learning.
We also need to prepare the training data by padding the output with START character to make the decoder input.
Training the model
PyTorch doesn’t provide out-of-the-box function similar to TensorFlow or Keras's Model.fit()
.
We need to implement the training function that:
- Shuffle the training dataset and split the data into batches
- For each batch:
- Initialize optimizer by calling
optimizer.zero_grad()
to clear all gradient from previous iteration - Run the model to generate output, and calculate
loss
by comparing the expected and generated output using a certain criteria (in this case, CrossEntropyLoss) - Run
loss.backward()
to generate the gradient according to the loss, andoptimizer.step()
to let the optimizer update the model.
- Initialize optimizer by calling
I have found that we can get reasonably good model with the default Adam optimizer after 20-30 epochs of training (around an hour on CPUs, or a few minutes on GPUs).
Testing the model
Applying the trained PyTorch’s Sequence-to-Sequence model to write Katakana is very similar to the TensorFlow’s. More details explanation can be found in the previous article.
Starting with the encoder input and the decoder input with only one START character, we will keep having the model generate the next output character, update the decoder input, and use the updated decoder input.
After testing the model with some English words, I was able to reproduce the results the similar to the TensorFlow version.
- James : ジェームズ
- John : ジョン
- Robert : ロベルト
- Computer : コンプター (correctly, コンピューター)
- Taxi : タクシ (correctly, タクシー)

Wanasit T
Clap to support the author, help others find it, and make your opinion count.