TY - GEN
T1 - Comparing two binned probability distributions for information access evaluation
AU - Sakai, Tetsuya
N1 - Publisher Copyright:
© 2018 ACM.
PY - 2018/6/27
Y1 - 2018/6/27
N2 - Some modern information access tasks such as natural language dialogue tasks are difficult to evaluate, for often there is no such thing as the ground truth: different users may have different opinions about the system's output. A few task designs for dialogue evaluation have been implemented and/or proposed recently, where both the ground truth data and the system's output are represented as a distribution of users' votes over bins on a non-nominal scale. The present study first points out that popular bin-by-bin measures such as Jensen-Shannon divergence and Sum of Squared Errors are clearly not adequate for such tasks, and that cross-bin measures should be used. Through experiments using artificial distributions as well as real ones from a dialogue evaluation task, we demonstrate that two cross-bin measures, namely, the Normalised Match Distance (NMD; a special case of the Earth Mover's Distance) and the Root Symmetric Normalised Order-aware Divergence (RSNOD), are indeed substantially different from the bin-by-bin measures.Furthermore, RSNOD lies between the popular bin-by-bin measures and NMD in terms of how it behaves. We recommend using both of these measures in the aforementioned type of evaluation tasks.
AB - Some modern information access tasks such as natural language dialogue tasks are difficult to evaluate, for often there is no such thing as the ground truth: different users may have different opinions about the system's output. A few task designs for dialogue evaluation have been implemented and/or proposed recently, where both the ground truth data and the system's output are represented as a distribution of users' votes over bins on a non-nominal scale. The present study first points out that popular bin-by-bin measures such as Jensen-Shannon divergence and Sum of Squared Errors are clearly not adequate for such tasks, and that cross-bin measures should be used. Through experiments using artificial distributions as well as real ones from a dialogue evaluation task, we demonstrate that two cross-bin measures, namely, the Normalised Match Distance (NMD; a special case of the Earth Mover's Distance) and the Root Symmetric Normalised Order-aware Divergence (RSNOD), are indeed substantially different from the bin-by-bin measures.Furthermore, RSNOD lies between the popular bin-by-bin measures and NMD in terms of how it behaves. We recommend using both of these measures in the aforementioned type of evaluation tasks.
KW - Dialogue evaluation
KW - Earth mover's distance
KW - Evaluation measures
KW - Jensen-shannon divergence
KW - Kullback-leibler divergence
KW - Order-aware divergence
KW - Wasserstein distance
UR - http://www.scopus.com/inward/record.url?scp=85051518071&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85051518071&partnerID=8YFLogxK
U2 - 10.1145/3209978.3210073
DO - 10.1145/3209978.3210073
M3 - Conference contribution
AN - SCOPUS:85051518071
T3 - 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
SP - 1073
EP - 1076
BT - 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
PB - Association for Computing Machinery, Inc
T2 - 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
Y2 - 8 July 2018 through 12 July 2018
ER -