TY - GEN
T1 - Timing generating networks
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
AU - Fujie, Shinya
AU - Katayama, Hayato
AU - Sakuma, Jin
AU - Kobayashi, Tetsunori
N1 - Funding Information:
The research was supported by NII CRIS collaborative research program operated by NII CRIS and LINE Corporation.
Publisher Copyright:
© 2021 ISCA
PY - 2021
Y1 - 2021
N2 - A brand new neural network based precise timing generation framework, named the Timing Generating Network (TGN), is proposed and applied to turn-taking timing decision problems. Although turn-taking problems have conventionally been formalized as users' end-of-turn detection, this approach cannot estimate the precise timing at which a spoken dialogue system should take a turn to start its utterance. Since several conventional approaches estimate precise timings but the estimation executed only at/after the end of preceding user's utterance, they highly depend on the accuracy of intermediate decision modules, such as voice activity detection, etc. The advantages of the TGN are that its parameters are tunable via error backpropagation as it is described in a differentiable form as a whole, and it is free from inter-module error propagation as it has no deterministic intermediate modules. The experimental results show that the proposed system is superior to a conventional turn-taking system that adopts the hard decisions on user's voice activity detection and response time estimation.
AB - A brand new neural network based precise timing generation framework, named the Timing Generating Network (TGN), is proposed and applied to turn-taking timing decision problems. Although turn-taking problems have conventionally been formalized as users' end-of-turn detection, this approach cannot estimate the precise timing at which a spoken dialogue system should take a turn to start its utterance. Since several conventional approaches estimate precise timings but the estimation executed only at/after the end of preceding user's utterance, they highly depend on the accuracy of intermediate decision modules, such as voice activity detection, etc. The advantages of the TGN are that its parameters are tunable via error backpropagation as it is described in a differentiable form as a whole, and it is free from inter-module error propagation as it has no deterministic intermediate modules. The experimental results show that the proposed system is superior to a conventional turn-taking system that adopts the hard decisions on user's voice activity detection and response time estimation.
KW - Spoken dialogue system
KW - Timing control
KW - Turn taking
UR - http://www.scopus.com/inward/record.url?scp=85119195432&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119195432&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-874
DO - 10.21437/Interspeech.2021-874
M3 - Conference contribution
AN - SCOPUS:85119195432
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 3771
EP - 3775
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
Y2 - 30 August 2021 through 3 September 2021
ER -