Hybrid Phishing URL Detection Using Segmented Word Embedding

Eint Sandi Aung*, Hayato Yamana

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Phishing is a type of cybercrime committed by attackers to steal sensitive information. This paper focuses on URL-based phishing detection, i.e., detecting phishing webpages by analyzing the URL. Previously proposed methods tackled this problem; however, insufficient word tokenization of URLs arises unknown words, which degrades the detection accuracy. To solve the unknown-word problem, we propose a new tokenization algorithm, called URL-Tokenizer, which integrates BERT and WordSegment tokenizers, besides utilizing 24 NLP features. Then, we adopt the URL-Tokenizer to the DNN-CNN hybrid model to leverage the detection accuracy. Our experiment using the Ebbu2017 dataset confirmed that our word-DNN-CNN achieves an AUC of 99.89% compared to the state-of-the-art DNN-BiLSTM with an AUC of 98.78%.

Original languageEnglish
Title of host publicationInformation Integration and Web Intelligence - 24th International Conference, iiWAS 2022, Proceedings
EditorsEric Pardede, Pari Delir Haghighi, Ismail Khalil, Gabriele Kotsis
PublisherSpringer Science and Business Media Deutschland GmbH
Pages507-518
Number of pages12
ISBN (Print)9783031210464
DOIs
Publication statusPublished - 2022
Event24th International Conference on Information Integration and Web Intelligence, iiWAS 2022, held in conjunction with the 20th International Conference on Advances in Mobile Computing and Multimedia Intelligence, MoMM 2022 - Virtual, Online
Duration: 2022 Nov 282022 Nov 30

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13635 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International Conference on Information Integration and Web Intelligence, iiWAS 2022, held in conjunction with the 20th International Conference on Advances in Mobile Computing and Multimedia Intelligence, MoMM 2022
CityVirtual, Online
Period22/11/2822/11/30

Keywords

  • Phishing URL detection
  • Word embedding
  • Word segmentation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Hybrid Phishing URL Detection Using Segmented Word Embedding'. Together they form a unique fingerprint.

Cite this