English and taiwanese text categorization using N-gram based on Vector Space Model

Makoto Suzuki*, Naohide Yamagishi, Yi Ching Tsai, Takashi Ishida, Masayuki Goto

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In this paper, we present a new mathematical model based on a "Vector Space Model" and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.

Original languageEnglish
Title of host publicationISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications
Pages106-111
Number of pages6
DOIs
Publication statusPublished - 2010 Dec 1
Event2010 20th International Symposium on Information Theory and Its Applications, ISITA 2010 and the 2010 20th International Symposium on Spread Spectrum Techniques and Applications, ISSSTA 2010 - Taichung, Taiwan, Province of China
Duration: 2010 Oct 172010 Oct 20

Publication series

NameISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications

Conference

Conference2010 20th International Symposium on Information Theory and Its Applications, ISITA 2010 and the 2010 20th International Symposium on Spread Spectrum Techniques and Applications, ISSSTA 2010
Country/TerritoryTaiwan, Province of China
CityTaichung
Period10/10/1710/10/20

Keywords

  • Classification
  • N-gram
  • Newspaper
  • Text mining

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Information Systems

Fingerprint

Dive into the research topics of 'English and taiwanese text categorization using N-gram based on Vector Space Model'. Together they form a unique fingerprint.

Cite this