TY - GEN
T1 - GigaSpeech
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
AU - Chen, Guoguo
AU - Chai, Shuzhou
AU - Wang, Guanbo
AU - Du, Jiayu
AU - Zhang, Wei Qiang
AU - Weng, Chao
AU - Su, Dan
AU - Povey, Daniel
AU - Trmal, Jan
AU - Zhang, Junbo
AU - Jin, Mingjie
AU - Khudanpur, Sanjeev
AU - Watanabe, Shinji
AU - Zhao, Shuaijiang
AU - Zou, Wei
AU - Li, Xiangang
AU - Yao, Xuchen
AU - Wang, Yongqing
AU - You, Zhao
AU - Yan, Zhiyong
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training sub-sets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
AB - This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training sub-sets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
KW - Corpus
KW - Forced alignment
KW - Segmentation
KW - Speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85119267084&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119267084&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1965
DO - 10.21437/Interspeech.2021-1965
M3 - Conference contribution
AN - SCOPUS:85119267084
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4376
EP - 4380
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -