TY - GEN
T1 - Segmentation-based Phishing URL Detection
AU - Aung, Eint Sandi
AU - Yamana, Hayato
N1 - Funding Information:
This work was supported by JSPS KAKENHI (Grant Number 17KT0085).
Publisher Copyright:
© 2021 ACM.
PY - 2021/12/14
Y1 - 2021/12/14
N2 - Uniform resource locators (URLs), used for referencing web pages, play a vital role in cyber fraud because of their complicated structure; phishers, or in other words, attackers, employ tricky by-passing techniques to deceive users. Thus, information extracted from URLs might indicate significant and meaningful patterns essential for phishing detection. To enhance the accuracy of URL-based phishing detection, we need an accurate word segmentation technique to split URLs correctly. However, in contrast to traditional word segmentation techniques used in natural language processing (NLP), URL segmentation requires meticulous attention, as tokenization, the process of turning meaningless data into meaningful data, is not as easy to apply as in NLP. In our work, we concentrate on URL segmentation to propose a novel tokenization method, named URL-Tokenizer, by combining the Bert tokenizer and WordSegment tokenizer, in addition to adopting character-level and word-level segmentations simultaneously. Our experimental evaluations in detecting the phishing URLs show that our proposed method achieves a high accuracy of 95.7% with a balanced dataset, and 97.7% with an imbalanced dataset, whereas baseline models achieved 85.4% with a balanced dataset and 85.1% with an imbalanced dataset.
AB - Uniform resource locators (URLs), used for referencing web pages, play a vital role in cyber fraud because of their complicated structure; phishers, or in other words, attackers, employ tricky by-passing techniques to deceive users. Thus, information extracted from URLs might indicate significant and meaningful patterns essential for phishing detection. To enhance the accuracy of URL-based phishing detection, we need an accurate word segmentation technique to split URLs correctly. However, in contrast to traditional word segmentation techniques used in natural language processing (NLP), URL segmentation requires meticulous attention, as tokenization, the process of turning meaningless data into meaningful data, is not as easy to apply as in NLP. In our work, we concentrate on URL segmentation to propose a novel tokenization method, named URL-Tokenizer, by combining the Bert tokenizer and WordSegment tokenizer, in addition to adopting character-level and word-level segmentations simultaneously. Our experimental evaluations in detecting the phishing URLs show that our proposed method achieves a high accuracy of 95.7% with a balanced dataset, and 97.7% with an imbalanced dataset, whereas baseline models achieved 85.4% with a balanced dataset and 85.1% with an imbalanced dataset.
KW - Information extraction
KW - Phishing URL detection
KW - Word segmentation
UR - http://www.scopus.com/inward/record.url?scp=85128677541&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85128677541&partnerID=8YFLogxK
U2 - 10.1145/3486622.3493983
DO - 10.1145/3486622.3493983
M3 - Conference contribution
AN - SCOPUS:85128677541
T3 - ACM International Conference Proceeding Series
SP - 550
EP - 556
BT - Proceedings - 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021
PB - Association for Computing Machinery
T2 - 2021 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2021
Y2 - 14 December 2021 through 17 December 2021
ER -