TY - GEN
T1 - Korean text categorization using the character TV-gram
AU - Suzuki, Makoto
AU - Yamagishi, Naohide
AU - Goto, Masayuki
PY - 2011/12/1
Y1 - 2011/12/1
N2 - We previously proposed the accumulation method, a language-independent text classification method that is based on the character N-gram, and classified English and Japanese text documents. The accumulation method does not depend on the language structure, because it uses the character N-gram to form Index Terms. If text documents are expressed in Unicode, the accumulation method can classify the documents using the same algorithm. In the present paper, we improve the proposed method and classify Korean text documents, which are newspaper articles from the Korean Hankyoreh 2008 data set. As a result, the highest macro-averaged F-measure of the proposed method is 90.2% for the Korean Hankyoreh 2008 data set. In this way, we obtain good results for Korean. In addition, we demonstrate the improvement in classification accuracy for English. Finally, we consider points of qualitative meaning of the accumulation method.
AB - We previously proposed the accumulation method, a language-independent text classification method that is based on the character N-gram, and classified English and Japanese text documents. The accumulation method does not depend on the language structure, because it uses the character N-gram to form Index Terms. If text documents are expressed in Unicode, the accumulation method can classify the documents using the same algorithm. In the present paper, we improve the proposed method and classify Korean text documents, which are newspaper articles from the Korean Hankyoreh 2008 data set. As a result, the highest macro-averaged F-measure of the proposed method is 90.2% for the Korean Hankyoreh 2008 data set. In this way, we obtain good results for Korean. In addition, we demonstrate the improvement in classification accuracy for English. Finally, we consider points of qualitative meaning of the accumulation method.
KW - Classification
KW - N-gram
KW - Newspaper
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=84868149485&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84868149485&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84868149485
SN - 9780980326741
T3 - 7th International Conference on Information Technology and Application, ICITA 2011
SP - 197
EP - 202
BT - 7th International Conference on Information Technology and Application, ICITA 2011
T2 - 7th International Conference on Information Technology and Application, ICITA 2011
Y2 - 21 November 2011 through 24 November 2011
ER -