TY - GEN
T1 - Two sample T-tests for IR evaluation
T2 - 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
AU - Sakai, Tetsuya
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/7/7
Y1 - 2016/7/7
N2 - There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.
AB - There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.
KW - Statistical significance
KW - Test collections
KW - Topics
KW - Variances
UR - http://www.scopus.com/inward/record.url?scp=84980398049&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84980398049&partnerID=8YFLogxK
U2 - 10.1145/2911451.2914684
DO - 10.1145/2911451.2914684
M3 - Conference contribution
AN - SCOPUS:84980398049
T3 - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 1045
EP - 1048
BT - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
Y2 - 17 July 2016 through 21 July 2016
ER -