TY - JOUR
T1 - Towards automatic evaluation of multi-turn dialogues
T2 - 8th International Workshop on Evaluating Information Access, EVIA 2017
AU - Sakai, Tetsuya
N1 - Publisher Copyright:
© 2017 Copyright held by the author.
PY - 2017
Y1 - 2017
N2 - This paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. The proposed task takes the form of an offline evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators' quality ratings for that dialogue; and (2) an estimated distribution of the annotators' nugget type labels for each utterance block (i.e., a maximal sequence of consecutive posts by the same utterer) in that dialogue. This shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. The proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task. While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD).
AB - This paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. The proposed task takes the form of an offline evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators' quality ratings for that dialogue; and (2) an estimated distribution of the annotators' nugget type labels for each utterance block (i.e., a maximal sequence of consecutive posts by the same utterer) in that dialogue. This shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. The proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task. While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD).
KW - Dialogues
KW - Divergence
KW - Evaluation
KW - Nuggets
KW - Probability distributions
KW - Test collections
UR - http://www.scopus.com/inward/record.url?scp=85038882038&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85038882038&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85038882038
SN - 1613-0073
VL - 2008
SP - 24
EP - 30
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
Y2 - 5 December 2017
ER -