Building a Japanese typo dataset from wikipedia’s revision history

Yu Tanaka, Yugo Murawaki, Daisuke Kawahara, Sadao Kurohashi

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

Original languageEnglish
Title of host publicationACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages230-236
Number of pages7
ISBN (Electronic)9781952148033
Publication statusPublished - 2020
Event58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Student Research Workshop, SRW 2020 - Virtual, Online, United States
Duration: 2020 Jul 52020 Jul 10

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference58th Annual Meeting of the Association for Computational Linguistics, ACL 2020 - Student Research Workshop, SRW 2020
Country/TerritoryUnited States
CityVirtual, Online
Period20/7/520/7/10

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Building a Japanese typo dataset from wikipedia’s revision history'. Together they form a unique fingerprint.

Cite this