TY - JOUR
T1 - Binary document classification based on fast flux discriminant with similarity measure on word set
AU - Okubo, Keisuke
AU - Kumoi, Gendo
AU - Goto, Masayuki
N1 - Funding Information:
The authors would like to thank anonymous referees for their useful suggestions. The authors would also like to thank all members of Goto Laboratory, Waseda university, for their support of our research. A portion of this study was supported by JSPS KAKENHI Grant Numbers 26282090 and 26560167.
Publisher Copyright:
© 2019 KIIE.
PY - 2019
Y1 - 2019
N2 - Fast Flux Discriminant (FFD) is known as one of the high-performance nonlinear binary classifiers, and it is possible to construct a classification model considering the interaction between variables. In order to take account of the interaction between variables, FFD introduces the histogram-based kernel smoothing using subspaces including variable combinations. However, when creating a subspace, the original FFD should cover all variables including combinations of variables with low interaction. Therefore, the disadvantage is that the calculation amount increases exponentially as the dimension increases. In this study, we calculate the similarity between variables by using KL divergence. Then, among the obtained similarities, divisions are performed for each subspace with similar variables. Through this method, we try to reduce the amount of calculation while maintaining classification accuracy by using only combinations of variables that are likely to take high interaction. Through the simulation experiments with Japanese newspaper articles, the effectiveness of our proposed method is clarified.
AB - Fast Flux Discriminant (FFD) is known as one of the high-performance nonlinear binary classifiers, and it is possible to construct a classification model considering the interaction between variables. In order to take account of the interaction between variables, FFD introduces the histogram-based kernel smoothing using subspaces including variable combinations. However, when creating a subspace, the original FFD should cover all variables including combinations of variables with low interaction. Therefore, the disadvantage is that the calculation amount increases exponentially as the dimension increases. In this study, we calculate the similarity between variables by using KL divergence. Then, among the obtained similarities, divisions are performed for each subspace with similar variables. Through this method, we try to reduce the amount of calculation while maintaining classification accuracy by using only combinations of variables that are likely to take high interaction. Through the simulation experiments with Japanese newspaper articles, the effectiveness of our proposed method is clarified.
KW - Binary classification
KW - Interaction
KW - KL divergence
KW - Similarity
KW - Text data
UR - http://www.scopus.com/inward/record.url?scp=85069958648&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85069958648&partnerID=8YFLogxK
U2 - 10.7232/iems.2019.18.2.245
DO - 10.7232/iems.2019.18.2.245
M3 - Article
AN - SCOPUS:85069958648
SN - 1598-7248
VL - 18
SP - 245
EP - 251
JO - Industrial Engineering and Management Systems
JF - Industrial Engineering and Management Systems
IS - 2
ER -