On the reliability of factoid question answering evaluation

Tetsuya Sakai*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)


This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.

Original languageEnglish
Article number1227853
JournalACM Transactions on Asian Language Information Processing
Issue number1
Publication statusPublished - 2007 Apr 1
Externally publishedYes


  • Evaluation metrics
  • Question answering

ASJC Scopus subject areas

  • General Computer Science

Cite this