Neural network-based text-to-speech (TTS) has made rapid progress in recent years. Previous neural TTS models (e.g. Tacotron 2) first generate mel-spectrograms autoregressively from text and then synthesize speech from generated mel-spectrograms using a separately trained vocoder. They usually suffer from slow inference speed, robustness (word skipping and repeating), and controllability issues.
Recently, non-autoregressive TTS models have been designed to address these issues. Among those non-autoregressive TTS methods, FastSpeech (Ren et al., 2019) is one of the most successful models.
However, in this article, we would like to introduce FastSpeech 2 to address the issues in FastSpeech and better handle the one-to-many mapping problem in non-autoregressive TTS. Respectively, we also provide an end-to-end tutorial to train models from scratch or fine-tuning with other datasets/languages with the Tensorflow framework.
What’s New in FastSpeech 2?
FastSpeech – a non-autoregressive model – has the ability to generate mel-spectrograms with extremely fast speed and improved robustness and controllability. It can achieve comparable voice quality with previous autoregressive models. However, there are still some disadvantages:
- The two-stage teacher-student pipeline with the Knowledge distillation method is complicated and time-consuming.
- The target mel-spectrograms distilled from the teacher model suffer from information loss due to data simplification.
- The phoneme duration extracted from the attention map of the teacher model is not accurate enough.
This FASTSPEECH 2: FAST AND HIGH-QUALITY END-TOEND TEXT TO SPEECH paper proposed the FastSpeech 2 model to solve the problems of FastSpeech as well as better solve the one-to-many problem. The solutions are presented as follows:
- To simplify the two-stage teacher-student training pipeline and avoid information loss due to data simplification, → directly trained the FastSpeech 2 model with a ground-truth target instead of the simplified output from a teacher.
- To reduce the information gap between the input (text sequence) and target output (mel-spectrograms) → introduced certain variation information of speech, including pitch and energy, and took them as conditional inputs.
Figure 1: The overall architecture for FastSpeech 2 and 2s. LR in subﬁgure (b) denotes the length regulator operation proposed in FastSpeech. LN in subﬁgure (c) denotes layer normalization. The Variance predictor represents the duration/pitch/energy predictor.
The overall model architecture of FastSpeech 2 is shown in Figure 1(a).
It follows the Feed-Forward Transformer (FFT) architecture in FastSpeech and introduces a variance adaptor between the phoneme encoder and the mel-spectrogram decoder, which adds different variance information such as duration, pitch, and energy into the hidden sequence to ease the one-to-many mapping problem.
FastSpeech alleviates the one-to-many mapping problem by knowledge distillation, leading to information loss. FastSpeech 2 improves the duration accuracy and introduces more variance information to reduce the information gap between input and output to ease the one-to-many mapping problem.
As shown in Figure 1(b), the variance adaptor consists of a duration predictor, a pitch predictor, and an energy predictor.
During training, we took the ground-truth value of the duration, pitch, and energy extracted from the recordings as input into the hidden sequence to predict the target speech. At the same time, we used separate variance predictors for the duration, pitch, and energy predictions, which were used during inference to synthesize target speech.
Instead of extracting the phoneme duration using a pre-trained autoregressive TTS model in FastSpeech, we extracted the phoneme duration with MFA (an open-source text-to-audio alignment toolkit).
In Figure 1(c), the duration/pitch/energy predictor includes a 2-layer 1D-convolutional network with ReLU activation, each followed by layer normalization and dropout, and an extra linear layer to project the hidden states into the output sequence.
For the duration predictor, the output is the length of each phoneme in the logarithmic domain. The pitch predictor’s output sequence is the frame-level fundamental frequency sequence. For the energy predictor, the output is a sequence of the energy of each mel-spectrogram frame.
Audio quality: FastSpeech 2’s MOS is higher than Tacotron 2 and Transformer TTS. In particular, FastSpeech 2 outperforms FastSpeech. This demonstrates the effectiveness of providing variance information such as pitch, energy, and more accurate duration and directly taking ground-truth speech as a training target without using a teacher-student distillation pipeline.
Table 1: The MOS evaluation.
Training and inference speedup: compared the training and inference time between FastSpeech 2 and FastSpeech in Table 2. We can see that FastSpeech 2 reduces the total training time by 3.12 times in comparison to FastSpeech. It’s because we remove the teacher-student distillation pipeline, and FastSpeech 2 speeds up the audio generation by 47 times in waveform synthesis in comparison to Transformer TTS.
Table 2: The comparison of training time and inference latency in waveform synthesis. RTF denotes the real-time factor, the time (in seconds) required for the system to synthesize a one-second waveform. The training and inference latency test is conducted on a server with 36 Intel Xeon CPUs, 256GB memory, 1 NVIDIA V100 GPU, and a batch size of 48 for training and 1 for inference.
TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based on TensorFlow 2. With Tensorflow 2, we can
- Speed up training/inference progress
- Optimize further by using fake-quantize aware and pruning
- Make TTS models can be run faster than real-time
- Be able to deploy on mobile devices or embedded systems.
TensorflowTTS only supports 6 languages, including Chinese, English, Korean, German, French, and Japanese.
There are more than 5,000 languages around the world, but very few languages have large enough datasets to train high-quality TTS models. Vietnamese is one of these few languages that have low resources and public audio dataset. The motivation of this work is to implement FastSpeech 2 using TensorflowTTS to provide a robust, fast, scalable, and reliable Vietnamese text to speech system.
The dataset we use in this article is INFORE, donated by the InfoRe Technology company. It contains roughly 25 hours of a speech corpus by recording speech data from a native Vietnamese volunteer. The training set includes 11,955 recorded files, and the validation set is 2,980 (14,935 in total).
This dataset offers two features:
- Audio files in .wav.
- Text file that contains all the transcriptions of audio files.
Please visit this GitHub repository to refer to an end-to-end tutorial for training FastSpeech 2 on the Vietnamese dataset or other languages.
We have published the first version of the model on Huggingface. You can find the model by typing “fastspeech2 vi” on the search bar and picking the first result.
Or simply click the links below:
The ability to convert the model into different types, such as TFlite, TFjs, etc., makes the inference/deployment more flexible than ever. Implementing FastSpeech 2 with the Tensorflow framework opens huge opportunities for developers and users to approach many SOTA methods more easily.
If you still stay curious about Vietnamese text to speech, check out our article on Vietnamese Automatic Speech Recognition Using NVIDIA – QuartzNet Model.