Comparing metrics across TREC and NTCIR: The robustness to system bias

Tetsuya Sakai*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

24 Citations (Scopus)


Test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in a more realistic setting, by reducing the number of pooled systems. Even though previous work has shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold in the presence of system bias. In our experiments using TREC and NTCIR data, we first show that condensed-list metrics overestimate new systems while traditional metrics underestimate them, and that the overestimation tends to be larger than the underestimation. We then show that, when relevance data is heavily biased towards a single team or a few teams, the condensed-list versions of Average Precision (AP), Q-measure (Q) and normalised Discounted Cumulative Gain (nDCG), which we call AP', Q' and nDCG', are not necessarily superior to the original metrics in terms of discriminative power, i.e., the overall ability to detect pairwise statistical significance. Nevertheless, even under system bias, AP' and Q' are generally more discriminative than bpref and the condensed-list version of Rank-Biased Precision (RBP), which we call RBP'.

Original languageEnglish
Title of host publicationProceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
Number of pages10
Publication statusPublished - 2008
Externally publishedYes
Event17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA, United States
Duration: 2008 Oct 262008 Oct 30

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings


Conference17th ACM Conference on Information and Knowledge Management, CIKM'08
Country/TerritoryUnited States
CityNapa Valley, CA


  • Evaluation metrics
  • Graded relevance
  • Test collection

ASJC Scopus subject areas

  • Decision Sciences(all)
  • Business, Management and Accounting(all)


Dive into the research topics of 'Comparing metrics across TREC and NTCIR: The robustness to system bias'. Together they form a unique fingerprint.

Cite this