TY - JOUR
T1 - Combating the infodemic
T2 - A chinese infodemic dataset for misinformation identification
AU - Luo, Jia
AU - Xue, Rui
AU - Hu, Jinglu
AU - El Baz, Didier
N1 - Funding Information:
Funding: This work is supported by the Beijing Municipal Education Commission (Grant No.SM202110005011 and Grant No.SM202010005004), the International Research Cooperation Seed Fund of Beijing University of Technology (Grant No. B38) and the Japan Society for the Promotion of Science (Grant No. P19800).
Publisher Copyright:
© 2021 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2021/8
Y1 - 2021/8
N2 - Misinformation posted on social media during COVID-19 is one main example of infodemic data. This phenomenon was prominent in China when COVID-19 happened at the beginning. While a lot of data can be collected from various social media platforms, publicly available infodemic detection data remains rare and is not easy to construct manually. Therefore, instead of developing techniques for infodemic detection, this paper aims at constructing a Chinese infodemic dataset, “infodemic 2019”, by collecting widely spread Chinese infodemic during the COVID-19 outbreak. Each record is labeled as true, false or questionable. After a four-time adjustment, the original imbalanced dataset is converted into a balanced dataset by exploring the properties of the collected records. The final labels achieve high intercoder reliability with healthcare workers’ annotations and the high-frequency words show a strong relationship between the proposed dataset and pandemic diseases. Finally, numerical experiments are carried out with RNN, CNN and fastText. All of them achieve reasonable performance and present baselines for future works.
AB - Misinformation posted on social media during COVID-19 is one main example of infodemic data. This phenomenon was prominent in China when COVID-19 happened at the beginning. While a lot of data can be collected from various social media platforms, publicly available infodemic detection data remains rare and is not easy to construct manually. Therefore, instead of developing techniques for infodemic detection, this paper aims at constructing a Chinese infodemic dataset, “infodemic 2019”, by collecting widely spread Chinese infodemic during the COVID-19 outbreak. Each record is labeled as true, false or questionable. After a four-time adjustment, the original imbalanced dataset is converted into a balanced dataset by exploring the properties of the collected records. The final labels achieve high intercoder reliability with healthcare workers’ annotations and the high-frequency words show a strong relationship between the proposed dataset and pandemic diseases. Finally, numerical experiments are carried out with RNN, CNN and fastText. All of them achieve reasonable performance and present baselines for future works.
KW - COVID-19
KW - Deep learning
KW - Infodemic data
KW - Misinformation identification
UR - http://www.scopus.com/inward/record.url?scp=85113950428&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85113950428&partnerID=8YFLogxK
U2 - 10.3390/healthcare9091094
DO - 10.3390/healthcare9091094
M3 - Article
AN - SCOPUS:85113950428
SN - 2227-9032
VL - 9
JO - Healthcare (Switzerland)
JF - Healthcare (Switzerland)
IS - 9
M1 - 1094
ER -