TY - GEN
T1 - End-to-End Automatic Speech Recognition Integrated with CTC-Based Voice Activity Detection
AU - Yoshimura, Takenori
AU - Hayashi, Tomoki
AU - Takeda, Kazuya
AU - Watanabe, Shinji
N1 - Funding Information:
This research is supported by the Center of Innovation (COI) program from the Japan Science and Technology Agency (JST). The authors thank Vimal Manohar for kindly providing us with a DNN-based voice activity detector.
Publisher Copyright:
© 2020 IEEE.
PY - 2020/5
Y1 - 2020/5
N2 - This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based on a greedy search with the CTC (pre-)softmax output. This prediction includes consecutive long blank labels, which can be regarded as a non-speech region. We use the labels as a cue for detecting speech segments with simple thresholding. The threshold value is directly related to the length of a non-speech region, which is more intuitive and easier to control than conventional VAD hyperparameters. Experimental results on unsegmented data show that the proposed method outperformed the baseline methods using the conventional energy-based and neural-network-based VAD methods and achieved an RTF less than 0.2. The proposed method is publicly available.1
AB - This paper integrates a voice activity detection (VAD) function with end-to-end automatic speech recognition toward an online speech interface and transcribing very long audio recordings. We focus on connectionist temporal classification (CTC) and its extension of CTC/attention architectures. As opposed to an attention-based architecture, input-synchronous label prediction can be performed based on a greedy search with the CTC (pre-)softmax output. This prediction includes consecutive long blank labels, which can be regarded as a non-speech region. We use the labels as a cue for detecting speech segments with simple thresholding. The threshold value is directly related to the length of a non-speech region, which is more intuitive and easier to control than conventional VAD hyperparameters. Experimental results on unsegmented data show that the proposed method outperformed the baseline methods using the conventional energy-based and neural-network-based VAD methods and achieved an RTF less than 0.2. The proposed method is publicly available.1
KW - CTC greedy search
KW - Speech recognition
KW - end-to-end
KW - streaming
KW - voice activity detection
UR - http://www.scopus.com/inward/record.url?scp=85089208881&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089208881&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9054358
DO - 10.1109/ICASSP40776.2020.9054358
M3 - Conference contribution
AN - SCOPUS:85089208881
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6999
EP - 7003
BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Y2 - 4 May 2020 through 8 May 2020
ER -