Text to Speech

Vietnamese Automatic Speech Recognition Using NVIDIA – QuartzNet Model

Trinh Nguyen

Technical/Content Writer

In this article, we demonstrate the efficacy of transfer learning automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model and show that transfer learning can be effectively and easily performed in a different language (for example, Vietnamese), even if the dataset for fine-tuning is small.

Index Terms – Vietnamese, automatic speech recognition, transfer learning, text to speech Vietnamese

Introduction

In the last few years, end-to-end (E2E) neural networks (NN) have achieved new state-of-the-art (SOTA) results on many automatic speech recognition tasks (ASR). Such models replace the traditional multi-component ASR system with a single, end-to-end trained NN that directly predicts character sequences. Therefore, it greatly simplifies training, fine-tuning, and inference.

But, E2E training of ASR models requires large datasets and heavy compute resources. There are more than 5,000 languages around the world, but very few languages have large enough datasets to train high-quality ASR models. Vietnamese is one of these few languages that have a very small dataset.

The motivation of this work is to implement Vietnamese language ASR from the pre-trained English model (QuartzNet 15×5).

Related Works

Transfer learning for ASR was originally used for Gaussian Mixture Model – Hidden Markov Model (GMM-HMM) systems. It relied on the idea that phoneme representation can be shared across different languages.

Anderson et al applied this idea to acoustic modeling using the International Phonetic Alphabet (IPA). The cross-language acoustic model adaptation was explored in depth in the GlobalPhone project. It was based on two methods: (1) partial model adaptation for languages with limited data, and (2) boot-strapping, where the model for a new target is initialized with a model for another language and then completely re-trained on the target dataset.

Hybrid Deep Neural Network (DNN) – HMM models also made use of TL. Basically, the features learned by DNN models tend to be language-independent at low layers. So, all languages can share these low-level layers.

This hypothesis was experimentally confirmed by TL between ASR models for Germanic, Romance, and Slavic languages. Kunze et al applied TL to DNN-based end-to-end ASR models and adapted an English ASR model for German. In their experiments, they used a Wav2Letter model and froze the lower convolutional layers while retraining the upper layers.

Similarly, Bukhar et al adapted a multi-language ASR model for two new low-resource languages (Uyghur and Vietnamese) by retraining the network’s last layer. Tong et al trained a multilingual CTC-based model with an IPA-based phone set and then adapted it for a language with limited data.

They compared three approaches for cross-lingual adaptation: (1) retraining only an output layer; (2) retraining all parameters; (3) randomly initializing weights of the last layer and then updating the whole network. They found that updating all the parameters performs better than only retraining the output layer.

Model Architecture

model architechture text to speech

Our experiments use a QuartzNet model trained with Connectionist Temporal Classification (CTC) loss. QuartzNet employs 1D time-channel separable convolutions, a 1D version of depthwise separable convolutions. Each depthwise separable convolution module is made up of two parts: a depthwise convolutional layer and a pointwise convolutional layer.

Depthwise convolutions apply a single filter per input channel (input depth). Pointwise convolutions are 1 × 1, helping create a linear combination of the outputs of the depthwise layer. BatchNorm and ReLU are applied to the outputs of both layers.

Each 1D time-channel separable convolution block can be separated into a 1D convolutional layer with kernel length K. It operates on each channel separately across K time frames and a point-wise convolutional layer that runs on each time frame independently across all channels.

QuartzNet models have the following structure: start with a 1D time-channel separable convolutional layer C₁ followed by a sequence of blocks (see Figure above). Each block B_i is repeated S_i times and has residual connections between each repetition. Each block B_i consists of the same base modules repeated R_i times.

The base module contains four layers:

1) Ksized depthwise convolutional layer with C channels

2) a pointwise convolution

3) a normalization layer

4) ReLU

The last part of the model consists of one additional time-channel separable convolution (C₂), and two 1D convolutional layers (C₃, C₄). The C₁ layer has a stride of 2, and C₂ layer has a dilation of 2.

There are five unique blocks: B₁ – B₅, and each block is repeated S = 3 times (B₁ – B₁ – B₁ − … – B₅ – B₅ – B₅).

A regular 1D convolutional layer with kernel size K, cin input channels, and cout output channels have K × c_in × c_out weights. The time-channel separable convolutions use
K × c_in+ c_in × c_out weights split into K × c_in weights for the depthwise layer and c_in × c_out for the pointwise layer.

Dataset

The dataset we use in this article is the VIVOS dataset, which contains a speech corpus by recording speech data from more than 50 native Vietnamese volunteers.

For training, 46 speakers (22 males and 24 females) help record 15 hours of speech with 11,660 utterances. While for testing, another set of 19 speakers (12 males and 7 females) recorded 50 minutes of speech with 760 utterances in total.

This dataset has two features:

Audio files in .wav
Text file that contains all the transcriptions of audio files.

Walk-through The Code

We recommend using Google Colab for this training section. But if you have all the dependencies and GPU, you can run it in your local.

First, open a new Python 3 notebook and follow the instructions below.

Install Dependencies

!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install unidecode
BRANCH = ‘main’
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

Download Dataset

import glob
import os
import subprocess
import tarfile
import wget

data_dir = '.'


# Download the dataset

print("******")
if not os.path.exists(data_dir + '/vivos.tar.gz'):
    vivos_url = 'https://ailab.hcmus.edu.vn/assets/vivos.tar.gz'
    vivos_path = wget.download(vivos_url, data_dir)
    print(f"Dataset downloaded at: {vivos_path}")
else:
    print("Tarfile already exists.")
    vivos_path = data_dir + '/vivos_sphere.tar.gz' 
if not os.path.exists(data_dir + '/vivos/'):
    # Untar and convert .sph to .wav (using sox)
    tar = tarfile.open(vivos_path)
    tar.extractall(path=data_dir)
     print("Converting .sph to .wav...")
    sph_list = glob.glob(data_dir + '/vivos/**/*.sph', recursive=True)
    for sph_path in sph_list:
        wav_path = sph_path[:-4] + '.wav'
        cmd = ["sox", sph_path, wav_path]
        subprocess.run(cmd)

print("Finished conversion.\n******")

Character Encoding CTC Model

Now that we have a processed dataset, we can begin training an ASR model on this dataset. The following section will detail how we prepare a CTC model which utilizes a Character Encoding scheme.

This section will utilize a pre-trained QuartzNet 15×5, trained on roughly 7,000 hours of English speech base model. We will modify the decoder layer (thereby changing the model’s vocabulary).

char_model =
 
nemo_asr.models.ASRModel.from_pretrained("stt_en_quartznet15x5")

Train Low-resource Languages

If the amount of training data or available computational resources is limited, it might be useful to freeze the encoder module of the network and train just the final decoder layer. This is also useful in cases where GPU memory is insufficient to train a large network, or the model might overfit due to its size. I recommend not doing it in Vietnamese because the vocal of Vietnamese and English speaker is very different. So we need to train its encoder also.

#@title Freeze Encoder { display-mode: "form" }

freeze_encoder = False #@param ["False", "True"] {type:"raw"}

freeze_encoder = bool(freeze_encoder)

if freeze_encoder:
  char_model.encoder.freeze()
  char_model.encoder.apply(enable_bn_se)
  logging.info("Model encoder has been frozen, and batch normalization has been unfrozen")
else:
  char_model.encoder.unfreeze()
  logging.info("Model encoder has been un-frozen")

Setting up Augmentation

Remember that the model was trained on several thousands of hours of data, so the regularization provided to it might not suit the current dataset. We can easily change it as we see fit.

Note: For low-resource languages, it might be better to increase augmentation via SpecAugment to reduce overfitting. However, this might, in turn, make it too hard for the model to train in a short number of epochs.

## Uncomment lines below if you want augment your data
# with open_dict(char_model.cfg.spec_augment):
# char_model.cfg.spec_augment.freq_masks = 2
# char_model.cfg.spec_augment.freq_width = 25
# char_model.cfg.spec_augment.time_masks = 2
# char_model.cfg.spec_augment.time_width = 0.05

char_model.spec_augmentation = 

char_model.from_config_dict(char_model.cfg.spec_augment)

Setup Metrics

Originally, the model was trained on an English dataset corpus. When calculating Word Error Rate, we can easily use the “space” token as a separator for word boundaries. On the other hand, certain languages such as Japanese and Mandarin do not use “space” tokens, instead opting for different ways to annotate the end of the word.

In cases where the “space” token is not used to denote a word boundary, we can use the Character Error Rate metric instead, which computes the edit distance at a token level rather than a word level.

We might also be interested in noting model predictions during training and inference. As such, we can enable logging of the predictions

#@title Metric

use_cer = True #@param ["False", "True"] {type:"raw"}

log_prediction = True #@param ["False", "True"] {type:"raw"}

char_model._wer.use_cer = use_cer

char_model._wer.log_prediction = log_prediction

Setup Trainer and Experiment Manager

And that’s it! We can train the model using the Pytorch Lightning Trainer and NeMo Experiment Manager, as always.

For demonstration purposes, the number of epochs is kept intentionally low. Reasonable results can be obtained in around 100 epochs (recommend 1,000 epochs for a better result if you have a GPU or Google Colab pro/pro+ account).

import torch
import pytorch_lightning as ptl 

if torch.cuda.is_available():
  accelerator = 'gpu'
else:
  accelerator = 'cpu' 

EPOCHS = 50  # 100 epochs would provide better results, but would take an hour to train 

trainer = ptl.Trainer(devices=1,
                       accelerator=accelerator,
                       max_epochs=EPOCHS,
                       accumulate_grad_batches=1,
                      enable_checkpointing=False,
                      logger=False,
                      log_every_n_steps=5,
                      check_val_every_n_epoch=10) 

# Setup model with the trainer
char_model.set_trainer(trainer) 

# Finally, update the model's internal configchar_model.cfg = 
char_model._cfg

# Environment variable generally used for multi-node multi-gpu training.
# In notebook environments, this flag is unnecessary and can cause logs of multiple training runs to overwrite each other.os.environ.pop('NEMO_EXPM_VERSION', None) 

config = exp_manager.ExpManagerConfig(
    exp_dir=f'experiments/lang-{LANGUAGE}/',
    name=f"ASR-Char-Model-Language-{LANGUAGE}",
    checkpoint_callback_params=exp_manager.CallbackParams(
        monitor="val_wer",
        mode="min",
        always_save_nemo=True,
        save_best_model=True,
    ),
) 

config = OmegaConf.structured(config)

logdir = exp_manager.exp_manager(trainer, config)

Let’s Train

%%time

trainer.fit(char_model)

You can visit here to refer to the full notebook for more details and information.

Result and Conclusion

We have tried several voice records in .wav format and got a good result. However, it still maintains some weaknesses:

The model predicted not very well on different vocals not included in the dataset.
It still can’t be ready for real-time ASR because of a very small dataset (about 15 hours of audio) respectively to its pre-trained datasets (roughly 7000 hours of English audio).

A transfer learning approach based on reusing a pre-trained QuartzNet network encoder turns out to be very effective for various ASR tasks. In all our experiments, we observed that fine-tuning a good baseline performs good results on a small dataset and small model.

We introduce this method to implement Vietnamese Automatic Speech Recognition (ASR) using QuartzNet 15×5 model. This model was based on a deep neural network with 1D time-channel separable convolutional layers. The small model (about 18,9M parameters) opens new possibilities for speech recognition on mobile and embedded devices.

References

Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition

QUARTZNET: DEEP AUTOMATIC SPEECH RECOGNITION WITH 1D TIME-CHANNEL SEPARABLE CONVOLUTIONS