Tacotron-pytorch

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Implement google's Tacotron TTS system with pytorch.

Updates

2018/09/15: Fix RNN feeding bug.

Requirements

Download python and pytorch by youself.

  • python==3.6.5
  • pytorch==0.4.1

You can use requirements.txt to download packages below.

# I recommend you use virtualenv.
$ pip install -r requirements.txt
  • librosa
  • numpy
  • pandas
  • scipy
  • matplotlib

Usage

  • Data
    Download LJSpeech provided by keithito. It contains 13100 short audio clips of a single speaker. The total length is approximately 20 hrs.

  • Set config.

# Set the 'meta_path' and 'wav_dir' in `hyperparams.py` to paths of your downloaded LJSpeech's meta file and wav directory.
meta_path = 'Data/LJSpeech-1.1/metadata.csv'
wav_dir = 'Data/LJSpeech-1.1/wavs'
  • Train
# If you have pretrained model, add --ckpt <ckpt_path>
$ python main.py --train --cuda
  • Evaluate
# You can change the evaluation texts in `hyperparams.py`
# ckpt files are saved in 'tmp/ckpt/' in default
$ python main.py --eval --cuda --ckpt <ckpt_timestep.pth.tar>

Samples

The sample texts is based on Harvard Sentences. See the samples at samples/ which are generated after training 200k.

Alignment

The model starts learning something at 30k.

alignment

Differences from the original Tacotron

  1. Data bucketing (Original Tacotron used loss mask)
  2. Remove residual connection in decoder_CBHG
  3. Batch size is set to 8
  4. Gradient clipping
  5. Noam style learning rate decay

GitHub