TY - GEN
T1 - Dictation of multiparty conversation using statistical turn taking model and speaker model
AU - Murai, Noriyuki
AU - Kobayashi, Tetsunori
PY - 2000/1/1
Y1 - 2000/1/1
N2 - A new speech decoder dealing with multiparty conversation is proposed. Multiparty conversation denotes a situation in which many speakers talk to each other. Almost of all conventional speech recognition systems assume that the input data consist of single speaker's voice. However, some applications, such as dialogue dictation and voice interfaces for multi-users, have to deal with mixed speakers' voices. In such a situation, the system has to recognize not only the word sequence of the input speech but also the speaker of each part of them. Therefore, we propose a decoder utilizing not only an acoustic model and language model, which are the resources of a conventional single-user speech decoder, but also a statistic turn taking model and speakers models to recognize speech. This framework realizes simultaneous maximum likelihood estimation of spoken word sequence and the speaker sequence. Experimental results using a TV sports news show that the proposed method reduce the word error rate by 7.7% and speaker error rate by 97.8% compared to the conventional method.
AB - A new speech decoder dealing with multiparty conversation is proposed. Multiparty conversation denotes a situation in which many speakers talk to each other. Almost of all conventional speech recognition systems assume that the input data consist of single speaker's voice. However, some applications, such as dialogue dictation and voice interfaces for multi-users, have to deal with mixed speakers' voices. In such a situation, the system has to recognize not only the word sequence of the input speech but also the speaker of each part of them. Therefore, we propose a decoder utilizing not only an acoustic model and language model, which are the resources of a conventional single-user speech decoder, but also a statistic turn taking model and speakers models to recognize speech. This framework realizes simultaneous maximum likelihood estimation of spoken word sequence and the speaker sequence. Experimental results using a TV sports news show that the proposed method reduce the word error rate by 7.7% and speaker error rate by 97.8% compared to the conventional method.
UR - http://www.scopus.com/inward/record.url?scp=0033677162&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0033677162&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2000.861980
DO - 10.1109/ICASSP.2000.861980
M3 - Conference contribution
AN - SCOPUS:0033677162
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 1575
EP - 1578
BT - Speech Processing II
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 25th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000
Y2 - 5 June 2000 through 9 June 2000
ER -