Master-ASR: Achieving Multilingual Scalability and Low-Resource Adaptation in ASR with Modular Learning

*EIC Lab @ Georgia Institute of Technology
#MIT-IBM AI Watson Lab

Abstract


Despite the impressive performance recently achieved by automatic speech recognition (ASR), we observe two primary challenges that hinder its broader applications: (1) The difficulty of introducing scalability into the model to support more languages with limited training, inference, and storage overhead; (2) The low-resource adaptation ability that enables effective low-resource adaptation while avoiding over-fitting and catastrophic forgetting issues. Inspired by recent findings, we hypothesize that we can address the above challenges with modules widely shared across languages. To this end, we propose an ASR framework, dubbed Master-ASR, that, for the first time, simultaneously achieves strong multilingual scalability and low-resource adaptation ability thanks to its modularize-then-assemble strategy. Specifically, Master-ASR learns a small set of generalizable sub-modules and adaptively assembles them for different languages to reduce the multilingual overhead and enable effective knowledge transfer for low-resource adaptation. Extensive experiments and visualizations demonstrate that MasterASR can effectively discover language similarity and improve multilingual and low-resource ASR performance over state-of-the-art (SOTA) methods, e.g., under multilingual-ASR, our framework achieves a 0.13∼2.41 lower character error rate (CER) with 30% smaller inference overhead over SOTA solutions on multilingual ASR and a comparable CER, with nearly 50 times fewer trainable parameters over SOTA solutions on low-resource tuning, respectively.

Challenges in ASR towards real-world applications


The multilingual scalability


An ideal ASR system should be able to support multiple languages, while avoiding excessive overhead in terms of the training, inference, or model storage cost when the number of supported languages increases (Yadav & Sitaram, 2022). To avoid the need for training completely different models for different languages (Babu et al., 2021; Conneau et al., 2020), the majority of existing works either introduce an adapter-like module to adapt the pretrained model to different languages with fewer additional model parameters (Le et al., 2021; Hou et al., 2021; Fu et al., 2022), or use a much larger model with a dedicated training recipe to increase the model capacity and cater to more complex multilingual ASR tasks (Li et al., 2021; 2022; Pratap et al., 2020). However, these approaches either require the model to be tuned for each language separately, resulting in high training costs (Le et al., 2021; Hou et al., 2021; Fu et al., 2022), or result in a significant increase in inference cost due to the larger model size (Li et al., 2021; 2022; Pratap et al., 2020).

The low-resource adaptation ability


Given the limited training data from low-resource languages (e.g., less than one hour per language as in (Fu et al., 2022)), effectively adapting the ASR model to target low-resource languages has been a long-lasting challenge in ASR. Existing attempts to address this challenge involve leveraging learned knowledge from pretrained models. In addition to directly tuning a pretrained model to low-resource languages (Hsu et al., 2021; Baevski et al., 2020; Conneau et al., 2020), techniques such as utilizing more data from other modalities (Zheng et al., 2021; Du et al., 2022; Liang et al., 2020), meta-learning (Hsu et al., 2020), and parameter-efficient tuning (Fu et al., 2022; Hou et al., 2021) are also used to further improve low-resource adaptation ability. However, how to better utilize the learned knowledge and avoid the issues of over-fitting (Hou et al., 2021; Cai et al., 2014) and catastrophic forgetting (Winata et al., 2020; Kessler et al., 2021) during adaptation remains an open research question.

Master-ASR overview


nvs

An illustration of (a) a vanilla pretrained model; (b) a Master-ASR model built on top of the vanilla pretrained model by replacing the corresponding vanilla QKV or Projection layer with a new one, Artisan Layer.

nvs

Block diagram of the proposed Artisan Layer and our proposed two-stage training pipeline: (a) Training Artisan Layer for scalable multilingual ASR, where we aim to learn (1) a mapping matrix T and (2) a set of Specialist Scores {Mk}(k ∈ [K]), where K = 4 in this example, and tune (3) the corresponding pretrained weights of the QKV or Projection layer; (b) Tuning Artisan Layer for low-resource ASR, where we aim to support a new language by only inserting and tuning a new row in the mapping matrix while freezing all other parameters in the Artisan Layer.

Evaluation


Benchmarking our Master-ASR with SOTA multilingual ASR solutions. The accuracy of each language is measured in terms of CER, and the reported inference, training, and storage overhead are normalized to separate weight tuning. Column ”All-avg” is the average CER achieved over 51 languages in our multilingual dataset. It is worth noting that all methods, except the Separate Weight Tuning baseline, in this table adopt a shared multilingual model to process all languages.

evaluation_platform

Benchmarking our Master-ASR on low-resource tuning with SOTA solutions. Each language is trained with only 10-min data. “Param.” indicates the number of trainable parameters.

evaluation_platform

Citation


@inproceedings{yu2023master,
title={Master-ASR: Achieving Multilingual Scalability and Low-Resource Adaptation in ASR with Modular Learning},
author={Yu, Zhongzhi and Zhang, Yang and Qian, Kaizhi and Wan, Cheng and Fu, Yonggan and Zhang, Yongan and Lin, Yingyan Celine},
booktitle={International Conference on Machine Learning},
pages={40475--40487},
year={2023},
organization={PMLR}
}

Acknowledgements


This work was supported in part by CoCoSys, one of the seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA and an IBM faculty award received by Dr. Yingyan (Celine) Lin.

The website template was borrowed from Instant Neural Graphics Primitives.