TY - JOUR
T1 - Streaming Automatic Speech Recognition with Re-blocking Processing Based on Integrated Voice Activity Detection
AU - Sudo, Yui
AU - Shakeel, Muhummad
AU - Nakadai, Kazuhiro
AU - Shi, Jiatong
AU - Watanabe, Shinji
N1 - Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - This paper proposes streaming automatic speech recognition (ASR) with re-blocking processing based on integrated voice activity detection (VAD). End-to-end (E2E) ASR models are promising for practical ASR. One of the key issues in realizing such a system is the detection of voice segments to cope with streaming input. There are three challenges for speech segmentation in streaming applications: 1) the extra VAD module in addition to the ASR model increases the system complexity and the number of parameters, 2) inappropriate segmentation of speech for block-based streaming methods deteriorates the performance, 3) non-voice segments that are not discarded results in the increase of unnecessary computational costs. This paper proposes a model that integrates a VAD branch into a block processing-based streaming ASR system and a re-blocking technique to avoid inappropriate isolation of the utterances. Experiments show that the proposed method reduces the detection error rate (ER) by 25.8% on the AMI dataset with a less than 1% of increase in the number of parameters. Furthermore, the proposed method show 7.5% relative improvement in character error rate (CER) on the CSJ dataset with 27.3% reduction in real-time factor (RTF).
AB - This paper proposes streaming automatic speech recognition (ASR) with re-blocking processing based on integrated voice activity detection (VAD). End-to-end (E2E) ASR models are promising for practical ASR. One of the key issues in realizing such a system is the detection of voice segments to cope with streaming input. There are three challenges for speech segmentation in streaming applications: 1) the extra VAD module in addition to the ASR model increases the system complexity and the number of parameters, 2) inappropriate segmentation of speech for block-based streaming methods deteriorates the performance, 3) non-voice segments that are not discarded results in the increase of unnecessary computational costs. This paper proposes a model that integrates a VAD branch into a block processing-based streaming ASR system and a re-blocking technique to avoid inappropriate isolation of the utterances. Experiments show that the proposed method reduces the detection error rate (ER) by 25.8% on the AMI dataset with a less than 1% of increase in the number of parameters. Furthermore, the proposed method show 7.5% relative improvement in character error rate (CER) on the CSJ dataset with 27.3% reduction in real-time factor (RTF).
UR - http://www.scopus.com/inward/record.url?scp=85140078041&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140078041&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-11216
DO - 10.21437/Interspeech.2022-11216
M3 - Conference article
AN - SCOPUS:85140078041
SN - 2308-457X
VL - 2022-September
SP - 4641
EP - 4645
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -