TY - GEN
T1 - Randomised vs. Prioritised Pools for Relevance Assessments
T2 - 15th Asia Information Retrieval Societies Conference, AIRS 2019
AU - Sakai, Tetsuya
AU - Xiao, Peng
N1 - Funding Information:
This research was partially supported by Chiang Mai University , Thailand.
Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - The present study concerns depth-k pooling for building IR test collections. At TREC, pooled documents are traditionally presented in random order to the assessors to avoid judgement bias. In contrast, an approach that has been used widely at NTCIR is to prioritise the pooled documents based on “pseudorelevance,” in the hope of letting assessors quickly form an idea as to what constitutes a relevant document and thereby judge more efficiently and reliably. While the recent TREC 2017 Common Core Track went beyond depth-k pooling and adopted a method for selecting documents to judge dynamically, even this task let the assessors process the usual depth-10 pools first: the idea was to give the assessors a “burn-in” period, which actually appears to echo the view of the NTCIR approach. Our research questions are: (1) Which depth-k ordering strategy enables more efficient assessments? Randomisation, or prioritisation by pseudorelevance? (2) Similarly, which of the two strategies enables higher inter-assessor agreements? Our experiments based on two English web search test collections with multiple sets of graded relevance assessments suggest that randomisation outperforms prioritisation in both respects on average, although the results are statistically inconclusive. We then discuss a plan for a much larger experiment with sufficient statistical power to obtain the final verdict.
AB - The present study concerns depth-k pooling for building IR test collections. At TREC, pooled documents are traditionally presented in random order to the assessors to avoid judgement bias. In contrast, an approach that has been used widely at NTCIR is to prioritise the pooled documents based on “pseudorelevance,” in the hope of letting assessors quickly form an idea as to what constitutes a relevant document and thereby judge more efficiently and reliably. While the recent TREC 2017 Common Core Track went beyond depth-k pooling and adopted a method for selecting documents to judge dynamically, even this task let the assessors process the usual depth-10 pools first: the idea was to give the assessors a “burn-in” period, which actually appears to echo the view of the NTCIR approach. Our research questions are: (1) Which depth-k ordering strategy enables more efficient assessments? Randomisation, or prioritisation by pseudorelevance? (2) Similarly, which of the two strategies enables higher inter-assessor agreements? Our experiments based on two English web search test collections with multiple sets of graded relevance assessments suggest that randomisation outperforms prioritisation in both respects on average, although the results are statistically inconclusive. We then discuss a plan for a much larger experiment with sufficient statistical power to obtain the final verdict.
KW - Evaluation
KW - Graded relevance
KW - Pooling
KW - Relevance assessments
KW - Web search
UR - http://www.scopus.com/inward/record.url?scp=85082381862&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85082381862&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-42835-8_9
DO - 10.1007/978-3-030-42835-8_9
M3 - Conference contribution
AN - SCOPUS:85082381862
SN - 9783030428341
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 94
EP - 105
BT - Information Retrieval Technology - 15th Asia Information Retrieval Societies Conference, AIRS 2019, Proceedings
A2 - Wang, Fu Lee
A2 - Xie, Haoran
A2 - Lam, Wai
A2 - Sun, Aixin
A2 - Ku, Lun-Wei
A2 - Hao, Tianyong
A2 - Chen, Wei
A2 - Wong, Tak-Lam
A2 - Tao, Xiaohui
PB - Springer
Y2 - 7 November 2019 through 9 November 2019
ER -