We propose neural networks for predicting response timing of spoken dialog systems. Response timing varies depending on the dialog context. This context-dependent response timing is conventionally estimated directly from acoustic event sequences and word sequences extracted from past utterances. Since there are so wide varieties in these sequences, large amounts of training data are required to build reliable models. While, there is no large dialog databases with response timings annotated. The proposed method estimates dialog act for each utterance as an auxiliary task, and uses its intermediate states for response timing estimation in addition to acoustic and linguistic features. Since dialog act has significantly less variation than word sequences and is closely related to response timing, we expect to be able to construct a highly reliable model even with small training data. We evaluate our approach on the HARPERVALLEYBANK corpus. The experimental results show that the proposed approach is more effective than the conventional approach that does not use dialog act information for each utterance such as dialog act.
|Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
|Published - 2022
|23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
継続期間: 2022 9月 18 → 2022 9月 22
ASJC Scopus subject areas