You can use Tensorboard to monitor the progress of your model’s training:
Speeding up training
Turn on mixed precision if your GPU supports it.
Initialize from a pretrained model, rather than a “cold” start.
Gradual training: begin with a high reduction factor (i.e. r = 7), so we make less granular predictions, yielding a “lower resolution” spectrogram but faster training. Then reduce r (i.e. r = 6) and continue training. Repeat until r = 2.
Step 2b: Optionally train your own vocoder
You can train a vocoder from scratch if you’d like.
Alternatively, just use a pretrained vocoder from the Coqui team: they have “universal” MelGAN and WaveGrad vocoders available.
“Once training is complete, you can get your model to say anything you’d like.”
Aims (revisited)
Leave the talk able to train a near state-of-art TTS system, with a voice of your choice, from scratch.
Understand the problem domain and common architectures for solutions.
That the paragraph below won’t be gibberish by the end of the session!
a recurrent sequence-to-sequence feature prediction network with attention which predicts a sequence of mel spectrogram frames from an input character sequence, combined with a vocoder which generates time-domain waveform samples conditioned on the predicted mel spectrogram frames. — Tacotron 2 paper
Hearing is Believing: Generating Realistic Speech with Deep Learning
Thanks for listening! Any questions? (You can also drop me a line: me@alexpeattie.com).