TY - GEN
T1 - Missing feature speech recognition in a meeting situation with maximum SNR beamforming
AU - Kolossa, Dorothea
AU - Araki, Shoko
AU - Delcroix, Marc
AU - Nakatani, Tomohiro
AU - Orglmeister, Reinhold
AU - Makino, Shoji
PY - 2008
Y1 - 2008
N2 - Especially for tasks like automatic meeting transcription, it would be useful to automatically recognize speech also while multiple speakers are talking simultaneously. For this purpose, speech separation can be performed, for example by using maximum SNR beamforming. However, even when good interferer suppression is attained, the interfering speech will still be recognizable during those intervals, where the target speaker is silent. In order to avoid the consequential insertion errors, a new soft masking scheme is proposed, which works in the time domain by inducing a large damping on those temporal periods, where the observed direction of arrival does not correspond to that of the target speaker. Even though the masking scheme is aggressive, by means of missing feature recognition the recognition accuracy can be improved significantly, with relative error reductions in the order of 60% compared to maximum SNR beamforming alone, and it is successful also for three simultaneously active speakers. Results are reported based on the SOLON speech recognizer, NTT's large vocabulary system [1], which is applied here for the recognition of artificially mixed data using real-room impulse responses and the entire clean test set of the Aurora 2 database.
AB - Especially for tasks like automatic meeting transcription, it would be useful to automatically recognize speech also while multiple speakers are talking simultaneously. For this purpose, speech separation can be performed, for example by using maximum SNR beamforming. However, even when good interferer suppression is attained, the interfering speech will still be recognizable during those intervals, where the target speaker is silent. In order to avoid the consequential insertion errors, a new soft masking scheme is proposed, which works in the time domain by inducing a large damping on those temporal periods, where the observed direction of arrival does not correspond to that of the target speaker. Even though the masking scheme is aggressive, by means of missing feature recognition the recognition accuracy can be improved significantly, with relative error reductions in the order of 60% compared to maximum SNR beamforming alone, and it is successful also for three simultaneously active speakers. Results are reported based on the SOLON speech recognizer, NTT's large vocabulary system [1], which is applied here for the recognition of artificially mixed data using real-room impulse responses and the entire clean test set of the Aurora 2 database.
UR - http://www.scopus.com/inward/record.url?scp=51749112218&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=51749112218&partnerID=8YFLogxK
U2 - 10.1109/ISCAS.2008.4542143
DO - 10.1109/ISCAS.2008.4542143
M3 - Conference contribution
AN - SCOPUS:51749112218
SN - 9781424416844
T3 - Proceedings - IEEE International Symposium on Circuits and Systems
SP - 3218
EP - 3221
BT - 2008 IEEE International Symposium on Circuits and Systems, ISCAS 2008
T2 - 2008 IEEE International Symposium on Circuits and Systems, ISCAS 2008
Y2 - 18 May 2008 through 21 May 2008
ER -