TY - GEN
T1 - Good Evaluation Measures based on Document Preferences
AU - Sakai, Tetsuya
AU - Zeng, Zhaohao
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/7/25
Y1 - 2020/7/25
N2 - For offline evaluation of IR systems, some researchers have proposed to utilise pairwise document preference assessments instead of relevance assessments of individual documents, as it may be easier for assessors to make relative decisions rather than absolute ones. Simple preference-based evaluation measures such as ppref and wpref have been proposed, but the past decade did not see any wide use of such measures. One reason for this may be that, while these new measures have been reported to behave more or less similarly to traditional measures based on absolute assessments, whether they actually align with the users' perception of search engine result pages (SERPs) has been unknown. The present study addresses exactly this question, after formally defining two classes of preference-based measures called Pref measures and Î"-measures. We show that the best of these measures perform at least as well as an average assessor in terms of agreement with users' SERP preferences, and that implicit document preferences (i.e., those suggested by a SERP that retrieves one document but not the other) play a much more important role than explicit preferences (i.e., those suggested by a SERP that retrieves one document above the other). We have released our data set containing 119,646 document preferences, so that the feasibility of document preferenced-based evaluation can be further pursued by the IR community.
AB - For offline evaluation of IR systems, some researchers have proposed to utilise pairwise document preference assessments instead of relevance assessments of individual documents, as it may be easier for assessors to make relative decisions rather than absolute ones. Simple preference-based evaluation measures such as ppref and wpref have been proposed, but the past decade did not see any wide use of such measures. One reason for this may be that, while these new measures have been reported to behave more or less similarly to traditional measures based on absolute assessments, whether they actually align with the users' perception of search engine result pages (SERPs) has been unknown. The present study addresses exactly this question, after formally defining two classes of preference-based measures called Pref measures and Î"-measures. We show that the best of these measures perform at least as well as an average assessor in terms of agreement with users' SERP preferences, and that implicit document preferences (i.e., those suggested by a SERP that retrieves one document but not the other) play a much more important role than explicit preferences (i.e., those suggested by a SERP that retrieves one document above the other). We have released our data set containing 119,646 document preferences, so that the feasibility of document preferenced-based evaluation can be further pursued by the IR community.
KW - adhoc retrieval
KW - document preferences
KW - evaluation measures
KW - preference assessments
KW - serp preferences.
UR - http://www.scopus.com/inward/record.url?scp=85090167400&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090167400&partnerID=8YFLogxK
U2 - 10.1145/3397271.3401115
DO - 10.1145/3397271.3401115
M3 - Conference contribution
AN - SCOPUS:85090167400
T3 - SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 359
EP - 368
BT - SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020
Y2 - 25 July 2020 through 30 July 2020
ER -