TY - GEN
T1 - A proposal of extended cosine measure for distance metric learning in text classification
AU - Mikawa, Kenta
AU - Ishida, Takashi
AU - Goto, Masayuki
PY - 2011
Y1 - 2011
N2 - This paper discusses a new similarity measure between documents on a vector space model from the view point of distance metric learning. The documents are represented by points in the vector space by using the information of frequencies of words appearing in each document. The similarity measure between two different documents is useful to recognize the relationship and can be applied to classification or clustering of the data. Usually, the cosine similarity and the Euclid distance have been used in order to measure the similarity between points in the Euclidean space. However, these measures do not take the correlation among words which appear in documents into consideration on an application of the vector space model to document analysis. Generally speaking, many words which appear in documents have correlation to one another depending on the sentence structures, topics and subjects. Therefore, it is effective to build a suitable metric measure taking the correlation of words into consideration on the vector space in order to improve the performance of document classification and clustering. This paper presents a new effective method to acquire a distance measure on the document vector space based on an extended cosine measure. In addition, the way of distance metric learning is proposed to acquire the proper metric from the view point of supervised learning. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of the customer review which is posted on the web site and the newspaper article.
AB - This paper discusses a new similarity measure between documents on a vector space model from the view point of distance metric learning. The documents are represented by points in the vector space by using the information of frequencies of words appearing in each document. The similarity measure between two different documents is useful to recognize the relationship and can be applied to classification or clustering of the data. Usually, the cosine similarity and the Euclid distance have been used in order to measure the similarity between points in the Euclidean space. However, these measures do not take the correlation among words which appear in documents into consideration on an application of the vector space model to document analysis. Generally speaking, many words which appear in documents have correlation to one another depending on the sentence structures, topics and subjects. Therefore, it is effective to build a suitable metric measure taking the correlation of words into consideration on the vector space in order to improve the performance of document classification and clustering. This paper presents a new effective method to acquire a distance measure on the document vector space based on an extended cosine measure. In addition, the way of distance metric learning is proposed to acquire the proper metric from the view point of supervised learning. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of the customer review which is posted on the web site and the newspaper article.
KW - extended cosine measure
KW - metric learning
KW - similarity measure
KW - text mining
KW - vector space model
UR - http://www.scopus.com/inward/record.url?scp=83755186800&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83755186800&partnerID=8YFLogxK
U2 - 10.1109/ICSMC.2011.6083923
DO - 10.1109/ICSMC.2011.6083923
M3 - Conference contribution
AN - SCOPUS:83755186800
SN - 9781457706523
T3 - Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
SP - 1741
EP - 1746
BT - 2011 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2011 - Conference Digest
T2 - 2011 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2011
Y2 - 9 October 2011 through 12 October 2011
ER -