TY - GEN
T1 - Incorporating Query Reformulating Behavior into Web Search Evaluation
AU - Chen, Jia
AU - Liu, Yiqun
AU - Mao, Jiaxin
AU - Zhang, Fan
AU - Sakai, Tetsuya
AU - Ma, Weizhi
AU - Zhang, Min
AU - Ma, Shaoping
N1 - Funding Information:
This work is supported by the National Key Research and Development Program of China (2018YFC0831700), Natural Science Foundation of China (Grant No. 61732008, 61532011, 61902209, U2001212), Beijing Academy of Articial Intelligence (BAAI), Tsinghua University Guoqiang Research Institute, Beijing Outstanding Young Scientist Program (NO. BJJWZYJH012019100020098) and Intelligent Social Governance Platform, Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class” Initiative, Renmin University of China.
Publisher Copyright:
© 2021 ACM.
PY - 2021/10/26
Y1 - 2021/10/26
N2 - While batch evaluation plays a central part in Information Retrieval (IR) research, most evaluation metrics are based on user models which mainly focus on browsing and clicking behaviors. As users' perceived satisfaction may also be impacted by their search intent, constructing different user models across various search intent may help design better evaluation metrics. However, user intents are usually unobservable in practice. As query reformulating behaviors may reflect their search intents to a certain extent and highly correlate with users' perceived satisfaction for a specific query, these observable factors may be beneficial for the design of evaluation metrics. How to incorporate the search intent behind query reformulation into user behavior and satisfaction models remains under-investigated. To investigate the relationships among query reformulations, search intent, and user satisfaction, we explore a publicly available web search dataset and find that query reformulations can be a good proxy for inferring user intent, and therefore, reformulating actions may be beneficial for designing better web search effectiveness metrics. A group of Reformulation-Aware Metrics (RAMs) is then proposed to improve existing click model-based metrics. Experimental results on two public session datasets have shown that RAMs have significantly higher correlations with user satisfaction than existing evaluation metrics. In the robustness test, we have found that RAMs can achieve good performance when only a small proportion of satisfaction training labels are available. We further show that RAMs can be directly applied in a new dataset for offline evaluation once trained. This work shows the possibility of designing better evaluation metrics by incorporating fine-grained search context factors.
AB - While batch evaluation plays a central part in Information Retrieval (IR) research, most evaluation metrics are based on user models which mainly focus on browsing and clicking behaviors. As users' perceived satisfaction may also be impacted by their search intent, constructing different user models across various search intent may help design better evaluation metrics. However, user intents are usually unobservable in practice. As query reformulating behaviors may reflect their search intents to a certain extent and highly correlate with users' perceived satisfaction for a specific query, these observable factors may be beneficial for the design of evaluation metrics. How to incorporate the search intent behind query reformulation into user behavior and satisfaction models remains under-investigated. To investigate the relationships among query reformulations, search intent, and user satisfaction, we explore a publicly available web search dataset and find that query reformulations can be a good proxy for inferring user intent, and therefore, reformulating actions may be beneficial for designing better web search effectiveness metrics. A group of Reformulation-Aware Metrics (RAMs) is then proposed to improve existing click model-based metrics. Experimental results on two public session datasets have shown that RAMs have significantly higher correlations with user satisfaction than existing evaluation metrics. In the robustness test, we have found that RAMs can achieve good performance when only a small proportion of satisfaction training labels are available. We further show that RAMs can be directly applied in a new dataset for offline evaluation once trained. This work shows the possibility of designing better evaluation metrics by incorporating fine-grained search context factors.
KW - evaluation metrics
KW - query reformulation
KW - web search
UR - http://www.scopus.com/inward/record.url?scp=85119176023&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119176023&partnerID=8YFLogxK
U2 - 10.1145/3459637.3482438
DO - 10.1145/3459637.3482438
M3 - Conference contribution
AN - SCOPUS:85119176023
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 171
EP - 180
BT - CIKM 2021 - Proceedings of the 30th ACM International Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - 30th ACM International Conference on Information and Knowledge Management, CIKM 2021
Y2 - 1 November 2021 through 5 November 2021
ER -