Tacotron-pytorch
A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model
Implement google's Tacotron TTS system with pytorch.
Updates
2018/09/15: Fix RNN feeding bug.
Requirements
Download python and pytorch by youself.
- python==3.6.5
- pytorch==0.4.1
You can use requirements.txt to download packages below.
# I recommend you use virtualenv.
$ pip install -r requirements.txt
- librosa
- numpy
- pandas
- scipy
- matplotlib
Usage
-
Data
Download LJSpeech provided by keithito. It contains 13100 short audio clips of a single speaker. The total length is approximately 20 hrs. -
Set config.
# Set the 'meta_path' and 'wav_dir' in `hyperparams.py` to paths of your downloaded LJSpeech's meta file and wav directory.
meta_path = 'Data/LJSpeech-1.1/metadata.csv'
wav_dir = 'Data/LJSpeech-1.1/wavs'
- Train
# If you have pretrained model, add --ckpt <ckpt_path>
$ python main.py --train --cuda
- Evaluate
# You can change the evaluation texts in `hyperparams.py`
# ckpt files are saved in 'tmp/ckpt/' in default
$ python main.py --eval --cuda --ckpt <ckpt_timestep.pth.tar>
Samples
The sample texts is based on Harvard Sentences. See the samples at samples/
which are generated after training 200k.
Alignment
The model starts learning something at 30k.
Differences from the original Tacotron
- Data bucketing (Original Tacotron used loss mask)
- Remove residual connection in decoder_CBHG
- Batch size is set to 8
- Gradient clipping
- Noam style learning rate decay