A proposal of extended cosine measure for distance metric learning in text classification

Kenta Mikawa*, Takashi Ishida, Masayuki Goto

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Citations (Scopus)

Abstract

This paper discusses a new similarity measure between documents on a vector space model from the view point of distance metric learning. The documents are represented by points in the vector space by using the information of frequencies of words appearing in each document. The similarity measure between two different documents is useful to recognize the relationship and can be applied to classification or clustering of the data. Usually, the cosine similarity and the Euclid distance have been used in order to measure the similarity between points in the Euclidean space. However, these measures do not take the correlation among words which appear in documents into consideration on an application of the vector space model to document analysis. Generally speaking, many words which appear in documents have correlation to one another depending on the sentence structures, topics and subjects. Therefore, it is effective to build a suitable metric measure taking the correlation of words into consideration on the vector space in order to improve the performance of document classification and clustering. This paper presents a new effective method to acquire a distance measure on the document vector space based on an extended cosine measure. In addition, the way of distance metric learning is proposed to acquire the proper metric from the view point of supervised learning. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of the customer review which is posted on the web site and the newspaper article.

Original languageEnglish
Title of host publication2011 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2011 - Conference Digest
Pages1741-1746
Number of pages6
DOIs
Publication statusPublished - 2011
Event2011 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2011 - Anchorage, AK, United States
Duration: 2011 Oct 92011 Oct 12

Publication series

NameConference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
ISSN (Print)1062-922X

Other

Other2011 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2011
Country/TerritoryUnited States
CityAnchorage, AK
Period11/10/911/10/12

Keywords

  • extended cosine measure
  • metric learning
  • similarity measure
  • text mining
  • vector space model

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Control and Systems Engineering
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'A proposal of extended cosine measure for distance metric learning in text classification'. Together they form a unique fingerprint.

Cite this