TY - GEN
T1 - Evaluating Relevance Judgments with Pairwise Discriminative Power
AU - Chu, Zhumin
AU - Mao, Jiaxin
AU - Zhang, Fan
AU - Liu, Yiqun
AU - Sakai, Tetsuya
AU - Zhang, Min
AU - Ma, Shaoping
N1 - Funding Information:
Another interesting phenomenon occurs on the data 4to7 (+∞). When comparing it with original 4-grade collection, we expect the values of evaluation metric in these two datasets would be the same due to their equivalence. As a result, we observe that PDP on 4to7 (+∞) is just the same as the 4-grade setting due to their identical document-level preference matrices. However, on DP measures, there exists a significant shift between two collections. The DP values of most metrics on 4to7 (+∞) setting are lower than those on the 4-grade setting. We assume that this phenomenon occurs because many ranking evaluation measures are not scalable. The linear or exponential gain of nDCG and Q-measure, the exponential stop probability of nERR cause the variability of measures when the relevance scale changes. This phenomenon is not a good signal. It might lead us to get wrong conclusions when using DP measures to compare annotation quality in cross-grade scenarios. 6 CONCLUSIONS In this paper, we propose a novel metric PDP to evaluate the discriminative ability of relevance judgment collections. Unlike the DP measure proposed by Sakai et al. [29], PDP does not need to be calculated based on a series of ranking lists from different retrieval systems but only requires introducing affordable amount of additional preference tests to evaluate the relevance judgment collections. We propose a unified framework for generating relevance judgment collections under multi-grade setting and then generate a series of synthetic data sets on different grade settings. We conduct a series of experiments on both synthetic and real-world datasets. Experimental results confirm that PDP, especially the individual mode version, can characterize the annotation quality to some extent and show competitive performance compared with , Krippendorff’s , Φ and DP. In our experiments, we also observe that consistency metrics is not appropriate to compare the annotation quality in cross-grade scenarios. PDP and DP can satisfy this evaluation need, but DP suffers from the value shift between different grades. We recommend the follow-up researchers adopt PDP metric to compare relevance judgments collected on different annotation settings. Then they can decide which setting to be employed for larger-scale annotation experiments. 7 ACKNOWLEDGEMENTS This work is supported by the National Key Research and Development Program of China (2018YFC0831700), Natural Science Foundation of China (Grant No. 61732008, 61532011, 61902209, U2001212), Beijing Academy of Articial Intelligence (BAAI), Tsinghua University Guoqiang Research Institute, Beijing Outstanding Young Scientist Program (NO. BJJWZYJH012019100020098) and Intelligent Social Governance Platform, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China.
Publisher Copyright:
© 2021 ACM.
PY - 2021/10/26
Y1 - 2021/10/26
N2 - Relevance judgments play an essential role in the evaluation of information retrieval systems. As many different relevance judgment settings have been proposed in recent years, an evaluation metric to compare relevance judgments in different annotation settings has become a necessity. Traditional metrics, such as , Krippendorff's α and φ have mainly focused on the inter-assessor consistency to evaluate the quality of relevance judgments. They encounter "reliable but useless"problem when employed to compare different annotation settings (e.g. binary judgment v.s. 4-grade judgment). Meanwhile, other existing popular metrics such as discriminative power (DP) are not designed to compare relevance judgments across different annotation settings, they therefore suffer from limitations, such as the requirement of result ranking lists from different systems. Therefore, how to design an evaluation metric to compare relevance judgments under different grade settings needs further investigation. In this work, we propose a novel metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections. By leveraging a small amount of document-level preference tests, PDP estimates the discriminative ability of relevance judgments on separating ranking lists with various qualities. With comprehensive experiments on both synthetic and real-world datasets, we show that PDP maintains a high degree of consistency with annotation quality in various grade settings. Compared with existing metrics (e.g., Krippendorff's α, φ, DP, etc), it provides reliable evaluation results with affordable additional annotation efforts.
AB - Relevance judgments play an essential role in the evaluation of information retrieval systems. As many different relevance judgment settings have been proposed in recent years, an evaluation metric to compare relevance judgments in different annotation settings has become a necessity. Traditional metrics, such as , Krippendorff's α and φ have mainly focused on the inter-assessor consistency to evaluate the quality of relevance judgments. They encounter "reliable but useless"problem when employed to compare different annotation settings (e.g. binary judgment v.s. 4-grade judgment). Meanwhile, other existing popular metrics such as discriminative power (DP) are not designed to compare relevance judgments across different annotation settings, they therefore suffer from limitations, such as the requirement of result ranking lists from different systems. Therefore, how to design an evaluation metric to compare relevance judgments under different grade settings needs further investigation. In this work, we propose a novel metric named pairwise discriminative power (PDP) to evaluate the quality of relevance judgment collections. By leveraging a small amount of document-level preference tests, PDP estimates the discriminative ability of relevance judgments on separating ranking lists with various qualities. With comprehensive experiments on both synthetic and real-world datasets, we show that PDP maintains a high degree of consistency with annotation quality in various grade settings. Compared with existing metrics (e.g., Krippendorff's α, φ, DP, etc), it provides reliable evaluation results with affordable additional annotation efforts.
KW - evaluation metric
KW - preference test
KW - relevance judgment
UR - http://www.scopus.com/inward/record.url?scp=85119203129&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119203129&partnerID=8YFLogxK
U2 - 10.1145/3459637.3482428
DO - 10.1145/3459637.3482428
M3 - Conference contribution
AN - SCOPUS:85119203129
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 261
EP - 270
BT - CIKM 2021 - Proceedings of the 30th ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - 30th ACM International Conference on Information and Knowledge Management, CIKM 2021
Y2 - 1 November 2021 through 5 November 2021
ER -