TY - GEN
T1 - On the Two-Sample Randomisation Test for IR Evaluation
AU - Sakai, Tetsuya
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/7/11
Y1 - 2021/7/11
N2 - While previous work in comparing statistical significance tests for IR system evaluation have focused on paired data tests (e.g., for evaluating two systems using a common test collection), two-sample tests must be used when the reproducibility of IR experiments across different test collections must be examined. Using real runs and a test collection from the NTCIR-15 WWW-3 Task, the present study compares the properties of three two-sample significance tests for comparing two systems: Student's t-test (i.e., the classical parametric test), the Wilcoxon rank sum test (i.e., the classical nonparametric test), and the randomisation test (i.e., a population-free method that utilises modern computational power). In terms of the false positive rate (i.e., the chance of detecting a statistical significance even though the two samples of evaluation measure scores come from the same system), the three tests behave similarly, although the Wilcoxon rank sum test appears to be slightly more robust than the other two for very small topic set sizes (e.g., 10 topics each) with a large significance level (e.g., α=0.10). On the other hand, the t-test and the Wilcoxon rank sum test are very similar to each other from the following two viewpoints: "How often do they both detect a nonexistent difference?"and "How often do they both overlook a true difference?"Compared to the two classical significance tests, the randomisation test behaves markedly differently in terms of the above two viewpoints. Hence, we suggest that researchers should at least be aware of the above properties of the three two-sample tests when choosing from them.
AB - While previous work in comparing statistical significance tests for IR system evaluation have focused on paired data tests (e.g., for evaluating two systems using a common test collection), two-sample tests must be used when the reproducibility of IR experiments across different test collections must be examined. Using real runs and a test collection from the NTCIR-15 WWW-3 Task, the present study compares the properties of three two-sample significance tests for comparing two systems: Student's t-test (i.e., the classical parametric test), the Wilcoxon rank sum test (i.e., the classical nonparametric test), and the randomisation test (i.e., a population-free method that utilises modern computational power). In terms of the false positive rate (i.e., the chance of detecting a statistical significance even though the two samples of evaluation measure scores come from the same system), the three tests behave similarly, although the Wilcoxon rank sum test appears to be slightly more robust than the other two for very small topic set sizes (e.g., 10 topics each) with a large significance level (e.g., α=0.10). On the other hand, the t-test and the Wilcoxon rank sum test are very similar to each other from the following two viewpoints: "How often do they both detect a nonexistent difference?"and "How often do they both overlook a true difference?"Compared to the two classical significance tests, the randomisation test behaves markedly differently in terms of the above two viewpoints. Hence, we suggest that researchers should at least be aware of the above properties of the three two-sample tests when choosing from them.
KW - evaluation
KW - randomisation
KW - statistical significance
KW - two-sample tests
UR - http://www.scopus.com/inward/record.url?scp=85111707597&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85111707597&partnerID=8YFLogxK
U2 - 10.1145/3404835.3463002
DO - 10.1145/3404835.3463002
M3 - Conference contribution
AN - SCOPUS:85111707597
T3 - SIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 1980
EP - 1984
BT - SIGIR 2021 - Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2021
Y2 - 11 July 2021 through 15 July 2021
ER -