TY - GEN
T1 - An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition
AU - Chang, Xuankai
AU - Maekaku, Takashi
AU - Guo, Pengcheng
AU - Shi, Jing
AU - Lu, Yen Ju
AU - Subramanian, Aswin Shanmugam
AU - Wang, Tianzi
AU - Yang, Shu Wen
AU - Tsao, Yu
AU - Lee, Hung Yi
AU - Watanabe, Shinji
N1 - Funding Information:
This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [50], which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system [51], which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks with-out domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select sev-eral pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model archi-tectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or out-perform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pre-training representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained mod-els have been released in ESPnet to let the community reproduce our experiments and improve them.
AB - Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks with-out domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select sev-eral pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model archi-tectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or out-perform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pre-training representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained mod-els have been released in ESPnet to let the community reproduce our experiments and improve them.
KW - ESPnet
KW - End-to-End Speech Recognition
KW - Representation Learning
UR - http://www.scopus.com/inward/record.url?scp=85117658726&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117658726&partnerID=8YFLogxK
U2 - 10.1109/ASRU51503.2021.9688137
DO - 10.1109/ASRU51503.2021.9688137
M3 - Conference contribution
AN - SCOPUS:85117658726
T3 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
SP - 228
EP - 235
BT - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021
Y2 - 13 December 2021 through 17 December 2021
ER -