Korean text categorization using the character TV-gram

Makoto Suzuki*, Naohide Yamagishi, Masayuki Goto

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We previously proposed the accumulation method, a language-independent text classification method that is based on the character N-gram, and classified English and Japanese text documents. The accumulation method does not depend on the language structure, because it uses the character N-gram to form Index Terms. If text documents are expressed in Unicode, the accumulation method can classify the documents using the same algorithm. In the present paper, we improve the proposed method and classify Korean text documents, which are newspaper articles from the Korean Hankyoreh 2008 data set. As a result, the highest macro-averaged F-measure of the proposed method is 90.2% for the Korean Hankyoreh 2008 data set. In this way, we obtain good results for Korean. In addition, we demonstrate the improvement in classification accuracy for English. Finally, we consider points of qualitative meaning of the accumulation method.

Original languageEnglish
Title of host publication7th International Conference on Information Technology and Application, ICITA 2011
Pages197-202
Number of pages6
Publication statusPublished - 2011 Dec 1
Event7th International Conference on Information Technology and Application, ICITA 2011 - Sydney, NSW, Australia
Duration: 2011 Nov 212011 Nov 24

Publication series

Name7th International Conference on Information Technology and Application, ICITA 2011

Conference

Conference7th International Conference on Information Technology and Application, ICITA 2011
Country/TerritoryAustralia
CitySydney, NSW
Period11/11/2111/11/24

Keywords

  • Classification
  • N-gram
  • Newspaper
  • Text mining

ASJC Scopus subject areas

  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Korean text categorization using the character TV-gram'. Together they form a unique fingerprint.

Cite this