TY - GEN
T1 - Auxiliary loss function for target speech extraction and recognition with weak supervision based on speaker characteristics
AU - Zmolikova, Katerina
AU - Delcroix, Marc
AU - Raj, Desh
AU - Watanabe, Shinji
AU - Černocký, Jan
N1 - Funding Information:
The work reported here was started at JSALT 2020 at JHU, supported by Microsoft, Amazon and Google. We’d like to thank Pavel Denisov, Christoph Boeddeker, Thilo von Neumann, Tobias Cord-Landwehr, Zili Huang and Maokui He for their help during the workshop. K. Zmolikova was partly supported by Czech Ministry of Education, Youth and Sports from project no. LTAIN19087 ”Multi-linguality in speech technologies”. Part of high-performance computation run on IT4I supercomputer and was supported by the Ministry of Education, Youth and Sports of the Czech Republic through e-INFRA CZ (ID:90140).
Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - Automatic speech recognition systems deteriorate in presence of overlapped speech. A popular approach to alleviate this is target speech extraction. The extraction system is usually trained with a loss function measuring the discrepancy between the estimated and the reference target speech. This often leads to distortions to the target signal which is detrimental to the recognition accuracy. Additionally, it is necessary to have the strong supervision provided by parallel data consisting of speech mixtures and single-speaker signals. We propose an auxiliary loss function for retraining the target speech extraction. It is composed of two parts: first, a speaker identity loss, forcing the estimated speech to have correct speaker characteristics, and second, a mixture consistency loss, making the extracted sources sum back to the original mixture. The only supervision required for the proposed loss is speaker characteristics obtained from several segments spoken by the target speaker. Such weak supervision makes the loss suitable for adapting the system directly on real recordings. We show that the proposed loss yields signals more suitable for speech recognition and further, we can gain additional improvements by adaptation to target data. Overall, we can reduce the word error rate on LibriCSS dataset from 27.4% to 24.0%.
AB - Automatic speech recognition systems deteriorate in presence of overlapped speech. A popular approach to alleviate this is target speech extraction. The extraction system is usually trained with a loss function measuring the discrepancy between the estimated and the reference target speech. This often leads to distortions to the target signal which is detrimental to the recognition accuracy. Additionally, it is necessary to have the strong supervision provided by parallel data consisting of speech mixtures and single-speaker signals. We propose an auxiliary loss function for retraining the target speech extraction. It is composed of two parts: first, a speaker identity loss, forcing the estimated speech to have correct speaker characteristics, and second, a mixture consistency loss, making the extracted sources sum back to the original mixture. The only supervision required for the proposed loss is speaker characteristics obtained from several segments spoken by the target speaker. Such weak supervision makes the loss suitable for adapting the system directly on real recordings. We show that the proposed loss yields signals more suitable for speech recognition and further, we can gain additional improvements by adaptation to target data. Overall, we can reduce the word error rate on LibriCSS dataset from 27.4% to 24.0%.
KW - Long recordings
KW - SpeakerBeam
KW - Target speech extraction
KW - Weakly supervised loss
UR - http://www.scopus.com/inward/record.url?scp=85119261659&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119261659&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-986
DO - 10.21437/Interspeech.2021-986
M3 - Conference contribution
AN - SCOPUS:85119261659
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 4156
EP - 4160
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -