TY - GEN
T1 - Building a Japanese typo dataset from wikipedia’s revision history
AU - Tanaka, Yu
AU - Murawaki, Yugo
AU - Kawahara, Daisuke
AU - Kurohashi, Sadao
N1 - Publisher Copyright:
© 2020 Association for Computational Linguistics.
PY - 2020
Y1 - 2020
N2 - User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.
AB - User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.
UR - http://www.scopus.com/inward/record.url?scp=85117930931&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85117930931&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85117930931
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 230
EP - 236
BT - ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
PB - Association for Computational Linguistics (ACL)
T2 - 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Student Research Workshop, SRW 2020
Y2 - 5 July 2020 through 10 July 2020
ER -