TY - JOUR
T1 - A Closer Look at Evaluation Measures for Ordinal Quantification
AU - Sakai, Tetsuya
N1 - Funding Information:
We thank the reviewers of the LQ 2021 workshop for their feedback on the initial version of this paper, and the organisers of the workshop for giving us the opportunity to publish our work.
Publisher Copyright:
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org)
PY - 2021
Y1 - 2021
N2 - In his ACL 2021 paper [1], Sakai compared several evaluation measures in the context of Ordinal Quantification (OQ) tasks in terms of system ranking similarity, system ranking consistency (i.e., robustness to the choice of test data), and discriminative power (i.e., ability to find many statistically significant differences). Based on his experimental results, he recommended the use of his RNOD (Root Normalised Order-aware Divergence) measure along with NMD (Normalised Match Distance, i.e., normalised Earth Mover's Distance). The present study follows up on his discriminative power experiments, by taking a much closer look at the statistical significance test results obtained from each evaluation measure. Our new analyses show that (1) RNOD is the overall winner among the OQ measures in terms of pooled discriminative power (i.e., discriminative power across multiple data sets); (2) NMD behaves noticeably differently from RNOD and from measures that cannot handle ordinal classes; (3) NMD tends to favour a popularity-based baseline (which accesses the gold distributions) over a uniform-distribution baseline, thus contradicting the other measures in terms of statistical significance. As both RNOD and NMD have their merits, we recommend the organisers of OQ tasks to use both of them to evaluate the systems from multiple angles.
AB - In his ACL 2021 paper [1], Sakai compared several evaluation measures in the context of Ordinal Quantification (OQ) tasks in terms of system ranking similarity, system ranking consistency (i.e., robustness to the choice of test data), and discriminative power (i.e., ability to find many statistically significant differences). Based on his experimental results, he recommended the use of his RNOD (Root Normalised Order-aware Divergence) measure along with NMD (Normalised Match Distance, i.e., normalised Earth Mover's Distance). The present study follows up on his discriminative power experiments, by taking a much closer look at the statistical significance test results obtained from each evaluation measure. Our new analyses show that (1) RNOD is the overall winner among the OQ measures in terms of pooled discriminative power (i.e., discriminative power across multiple data sets); (2) NMD behaves noticeably differently from RNOD and from measures that cannot handle ordinal classes; (3) NMD tends to favour a popularity-based baseline (which accesses the gold distributions) over a uniform-distribution baseline, thus contradicting the other measures in terms of statistical significance. As both RNOD and NMD have their merits, we recommend the organisers of OQ tasks to use both of them to evaluate the systems from multiple angles.
KW - Distributions
KW - Evaluation
KW - Evaluation measures
KW - Ordinal classes
KW - Ordinal quantification
KW - Prevalence estimation
UR - http://www.scopus.com/inward/record.url?scp=85122860963&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85122860963&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85122860963
SN - 1613-0073
VL - 3052
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - 2021 International Conference on Information and Knowledge Management Workshops, CIKMW 2021
Y2 - 1 November 2021 through 5 November 2021
ER -