TY - GEN
T1 - Contribution of Improved Character Embedding and Latent Posting Styles to Authorship Attribution of Short Texts
AU - Huang, Wenjing
AU - Su, Rui
AU - Iwaihara, Mizuho
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Text contents generated by social networking platforms tend to be short. The problem of authorship attribution on short texts is to determine the author of a given collection of short posts, which is more challenging than that on long texts. Considering the textual characteristics of sparsity and using informal terms, we propose a method of learning text representations using a mixture of words and character n-grams, as input to the architecture of deep neural networks. In this way we make full use of user mentions and topic mentions in posts. We also focus on the textual implicit characteristics and incorporate ten latent posting styles into the models. Our experimental evaluations on tweets show a significant improvement over baselines. We achieve a best accuracy of 83.6%, which is 7.5% improvement over the state-of-the-art. Further experiments with increasing number of authors also demonstrate the superiority of our models.
AB - Text contents generated by social networking platforms tend to be short. The problem of authorship attribution on short texts is to determine the author of a given collection of short posts, which is more challenging than that on long texts. Considering the textual characteristics of sparsity and using informal terms, we propose a method of learning text representations using a mixture of words and character n-grams, as input to the architecture of deep neural networks. In this way we make full use of user mentions and topic mentions in posts. We also focus on the textual implicit characteristics and incorporate ten latent posting styles into the models. Our experimental evaluations on tweets show a significant improvement over baselines. We achieve a best accuracy of 83.6%, which is 7.5% improvement over the state-of-the-art. Further experiments with increasing number of authors also demonstrate the superiority of our models.
KW - Authorship attribution
KW - CNN
KW - Character n-grams
KW - LSTM
KW - Latent posting styles
KW - Short texts
KW - Social network platforms
UR - http://www.scopus.com/inward/record.url?scp=85093850874&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85093850874&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-60290-1_20
DO - 10.1007/978-3-030-60290-1_20
M3 - Conference contribution
AN - SCOPUS:85093850874
SN - 9783030602895
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 261
EP - 269
BT - Web and Big Data - 4th International Joint Conference, APWeb-WAIM 2020, Proceedings
A2 - Wang, Xin
A2 - Zhang, Rui
A2 - Lee, Young-Koo
A2 - Sun, Le
A2 - Moon, Yang-Sae
PB - Springer Science and Business Media Deutschland GmbH
T2 - 4th Asia-Pacific Web and Web-Age Information Management, Joint Conference on Web and Big Data, APWeb-WAIM 2020
Y2 - 18 September 2020 through 20 September 2020
ER -