TY - GEN
T1 - Hybrid Phishing URL Detection Using Segmented Word Embedding
AU - Aung, Eint Sandi
AU - Yamana, Hayato
N1 - Funding Information:
Acknowledgement. This work was partially supported by JST SPRING Grant Number JPMJSP2128.
Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Phishing is a type of cybercrime committed by attackers to steal sensitive information. This paper focuses on URL-based phishing detection, i.e., detecting phishing webpages by analyzing the URL. Previously proposed methods tackled this problem; however, insufficient word tokenization of URLs arises unknown words, which degrades the detection accuracy. To solve the unknown-word problem, we propose a new tokenization algorithm, called URL-Tokenizer, which integrates BERT and WordSegment tokenizers, besides utilizing 24 NLP features. Then, we adopt the URL-Tokenizer to the DNN-CNN hybrid model to leverage the detection accuracy. Our experiment using the Ebbu2017 dataset confirmed that our word-DNN-CNN achieves an AUC of 99.89% compared to the state-of-the-art DNN-BiLSTM with an AUC of 98.78%.
AB - Phishing is a type of cybercrime committed by attackers to steal sensitive information. This paper focuses on URL-based phishing detection, i.e., detecting phishing webpages by analyzing the URL. Previously proposed methods tackled this problem; however, insufficient word tokenization of URLs arises unknown words, which degrades the detection accuracy. To solve the unknown-word problem, we propose a new tokenization algorithm, called URL-Tokenizer, which integrates BERT and WordSegment tokenizers, besides utilizing 24 NLP features. Then, we adopt the URL-Tokenizer to the DNN-CNN hybrid model to leverage the detection accuracy. Our experiment using the Ebbu2017 dataset confirmed that our word-DNN-CNN achieves an AUC of 99.89% compared to the state-of-the-art DNN-BiLSTM with an AUC of 98.78%.
KW - Phishing URL detection
KW - Word embedding
KW - Word segmentation
UR - http://www.scopus.com/inward/record.url?scp=85145009843&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85145009843&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-21047-1_46
DO - 10.1007/978-3-031-21047-1_46
M3 - Conference contribution
AN - SCOPUS:85145009843
SN - 9783031210464
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 507
EP - 518
BT - Information Integration and Web Intelligence - 24th International Conference, iiWAS 2022, Proceedings
A2 - Pardede, Eric
A2 - Delir Haghighi, Pari
A2 - Khalil, Ismail
A2 - Kotsis, Gabriele
PB - Springer Science and Business Media Deutschland GmbH
T2 - 24th International Conference on Information Integration and Web Intelligence, iiWAS 2022, held in conjunction with the 20th International Conference on Advances in Mobile Computing and Multimedia Intelligence, MoMM 2022
Y2 - 28 November 2022 through 30 November 2022
ER -