TY - JOUR
T1 - Evaluating evaluation measures with worst-case confidence interval widths
AU - Sakai, Tetsuya
PY - 2017/1/1
Y1 - 2017/1/1
N2 - IR evaluation measures are often compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Confidence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a confidence interval (CI) for the difference between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of differences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.
AB - IR evaluation measures are often compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Confidence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a confidence interval (CI) for the difference between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of differences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.
KW - ANOVA
KW - Confidence intervals
KW - Effect sizes
KW - Evaluation measures
KW - P-values
KW - Sample sizes
KW - Statistical significance
UR - http://www.scopus.com/inward/record.url?scp=85038855715&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85038855715&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85038855715
SN - 1613-0073
VL - 2008
SP - 16
EP - 19
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 8th International Workshop on Evaluating Information Access, EVIA 2017
Y2 - 5 December 2017
ER -