TY - GEN
T1 - Which diversity evaluation measures are “good”?
AU - Sakai, Tetsuya
AU - Zeng, Zhaohao
N1 - Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/7/18
Y1 - 2019/7/18
N2 - This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question “Which SERP is more relevant?” and the other based on a diversity question “Which SERP is likely to satisfy a higher number of users?” To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the D#-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best D#-measures when compared to the gold diversity preferences.
AB - This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question “Which SERP is more relevant?” and the other based on a diversity question “Which SERP is likely to satisfy a higher number of users?” To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the D#-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best D#-measures when compared to the gold diversity preferences.
KW - Evaluation measures
KW - Search result diversification
KW - User preferences
UR - http://www.scopus.com/inward/record.url?scp=85073795753&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073795753&partnerID=8YFLogxK
U2 - 10.1145/3331184.3331215
DO - 10.1145/3331184.3331215
M3 - Conference contribution
AN - SCOPUS:85073795753
T3 - SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 595
EP - 604
BT - SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019
Y2 - 21 July 2019 through 25 July 2019
ER -