TY - GEN
T1 - Capsule Network Over Pre-Trained Language Model and User Writing Styles for Authorship Attribution on Short Texts
AU - Huang, Zeping
AU - Iwaihara, Mizuho
N1 - Publisher Copyright:
© 2022 ACM.
PY - 2022/8/26
Y1 - 2022/8/26
N2 - Authorship Attribution (AA) is a sub-field of Authorship Analysis and text classification, attributing a text to the correct author among a closed set of potential authors. Since short texts usually contain less information about the author, authorship attribution on short texts is often more challenging than authorship attribution on long texts. Recently, the widespread use of pre-trained language models has greatly improved the accuracy of text classification tasks. In this paper, we propose a model which uses the pre-trained language model BERTweet with capsule networks, to solve the authorship attribution on tweets. BERTweet is the first large-scale domain-specific pre-trained language model for English tweets, which can generate high-quality sentence representations of tweets. We combine BERTweet with capsule networks which are particularly powerful at capturing deep features of sentence representations. Thus, both BERTweet and capsule help us achieve remarkable improvements on AA tasks. We also incorporate user writing styles into our model. We design new architectures of capsule networks which combine multiple capsule layers, for generating representations from tweets and user writing styles, improving prediction accuracy and robustness. Our experimental results show that our BERTweet_Capsule_UWS combination shows the state-of-the-art result on the known tweet AA dataset.
AB - Authorship Attribution (AA) is a sub-field of Authorship Analysis and text classification, attributing a text to the correct author among a closed set of potential authors. Since short texts usually contain less information about the author, authorship attribution on short texts is often more challenging than authorship attribution on long texts. Recently, the widespread use of pre-trained language models has greatly improved the accuracy of text classification tasks. In this paper, we propose a model which uses the pre-trained language model BERTweet with capsule networks, to solve the authorship attribution on tweets. BERTweet is the first large-scale domain-specific pre-trained language model for English tweets, which can generate high-quality sentence representations of tweets. We combine BERTweet with capsule networks which are particularly powerful at capturing deep features of sentence representations. Thus, both BERTweet and capsule help us achieve remarkable improvements on AA tasks. We also incorporate user writing styles into our model. We design new architectures of capsule networks which combine multiple capsule layers, for generating representations from tweets and user writing styles, improving prediction accuracy and robustness. Our experimental results show that our BERTweet_Capsule_UWS combination shows the state-of-the-art result on the known tweet AA dataset.
KW - Authorship attribution
KW - Capsule network
KW - Pre-trained language model
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=85140096979&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140096979&partnerID=8YFLogxK
U2 - 10.1145/3562007.3562027
DO - 10.1145/3562007.3562027
M3 - Conference contribution
AN - SCOPUS:85140096979
T3 - ACM International Conference Proceeding Series
SP - 104
EP - 110
BT - CCRIS 2022 - Conference Proceeding
PB - Association for Computing Machinery
T2 - 3rd International Conference on Control, Robotics and Intelligent System, CCRIS 2022
Y2 - 26 August 2022 through 28 August 2022
ER -