TY - GEN
T1 - Using graded-relevance metrics for evaluating community QA answer selection
AU - Sakai, Tetsuya
AU - Seki, Yohei
AU - Ishikawa, Daisuke
AU - Kuriyama, Kazuko
AU - Kando, Noriko
AU - Lin, Chin Yew
PY - 2011
Y1 - 2011
N2 - Community Question Answering (CQA) sites such as Yahoo ! Answers have emerged as rich knowledge resources for information seekers. However, answers posted to CQA sites can be irrelevant, incomplete, redundant, incorrect, biased, ill-formed or even abusive. Hence, automatic selection of "good" answers for a given posted question is a practical research problem that will help us manage the quality of accumulated knowledge. One way to evaluate answer selection systems for CQA would be to use the Best Answers (BAs) that are readily available from the CQA sites. However, BAs may be biased, and even if they are not, there may be other good answers besides BAs. To remedy these two problems, we propose system evaluation methods that involve multiple answer assessors and graded-relevance information retrieval metrics. Our main findings from experiments using the NTCIR-8 CQA task data are that, using our evaluation methods, (a) we can detect many substantial differences between systems that would have been overlooked by BA-based evaluation; and (b) we can better identify hard questions (i.e. those that are handled poorly by many systems and therefore require focussed investigation) compared to BAbased evaluation. We therefore argue that our approach is useful for building effective CQA answer selection systems despite the cost of manual answer assessments.
AB - Community Question Answering (CQA) sites such as Yahoo ! Answers have emerged as rich knowledge resources for information seekers. However, answers posted to CQA sites can be irrelevant, incomplete, redundant, incorrect, biased, ill-formed or even abusive. Hence, automatic selection of "good" answers for a given posted question is a practical research problem that will help us manage the quality of accumulated knowledge. One way to evaluate answer selection systems for CQA would be to use the Best Answers (BAs) that are readily available from the CQA sites. However, BAs may be biased, and even if they are not, there may be other good answers besides BAs. To remedy these two problems, we propose system evaluation methods that involve multiple answer assessors and graded-relevance information retrieval metrics. Our main findings from experiments using the NTCIR-8 CQA task data are that, using our evaluation methods, (a) we can detect many substantial differences between systems that would have been overlooked by BA-based evaluation; and (b) we can better identify hard questions (i.e. those that are handled poorly by many systems and therefore require focussed investigation) compared to BAbased evaluation. We therefore argue that our approach is useful for building effective CQA answer selection systems despite the cost of manual answer assessments.
KW - Best answers
KW - Community question answering
KW - Evaluation
KW - Graded relevance
KW - NTCIR
KW - Test collections
UR - http://www.scopus.com/inward/record.url?scp=79952373644&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79952373644&partnerID=8YFLogxK
U2 - 10.1145/1935826.1935864
DO - 10.1145/1935826.1935864
M3 - Conference contribution
AN - SCOPUS:79952373644
SN - 9781450304931
T3 - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
SP - 187
EP - 196
BT - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
T2 - 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
Y2 - 9 February 2011 through 12 February 2011
ER -