TY - JOUR
T1 - Exploration into gray area
T2 - Toward efficient labeling for detecting malicious domain names
AU - Fukushi, Naoki
AU - Chiba, Daiki
AU - Akiyama, Mitsuaki
AU - Uchida, Masato
N1 - Funding Information:
This work was supported in part by the Japan Society for the Promotion of Science through Grants-in-Aid for Scientific Research (C) (17K00135).
Publisher Copyright:
Copyright © 2020 The Institute of Electronics, Information and Communication Engineers.
PY - 2020
Y1 - 2020
N2 - In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.
AB - In this paper, we propose a method to reduce the labeling cost while acquiring training data for a malicious domain name detection system using supervised machine learning. In the conventional systems, to train a classifier with high classification accuracy, large quantities of benign and malicious domain names need to be prepared as training data. In general, malicious domain names are observed less frequently than benign domain names. Therefore, it is difficult to acquire a large number of malicious domain names without a dedicated labeling method. We propose a method based on active learning that labels data around the decision boundary of classification, i.e., in the gray area, and we show that the classification accuracy can be improved by using approximately 1% of the training data used by the conventional systems. Another disadvantage of the conventional system is that if the classifier is trained with a small amount of training data, its generalization ability cannot be guaranteed. We propose a method based on ensemble learning that integrates multiple classifiers, and we show that the classification accuracy can be stabilized and improved. The combination of the two methods proposed here allows us to develop a new system for malicious domain name detection with high classification accuracy and generalization ability by labeling a small amount of training data.
KW - Active learning
KW - Data labeling
KW - Ensemble learning
KW - Malicious domain name
UR - http://www.scopus.com/inward/record.url?scp=85082749397&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85082749397&partnerID=8YFLogxK
U2 - 10.1587/transcom.2019NRP0005
DO - 10.1587/transcom.2019NRP0005
M3 - Article
AN - SCOPUS:85082749397
SN - 0916-8516
VL - 103
SP - 375
EP - 388
JO - IEICE Transactions on Communications
JF - IEICE Transactions on Communications
IS - 4
ER -