TY - GEN
T1 - URL-based phishing detection using the entropy of non- A lphanumeric characters
AU - Aung, Eint Sandi
AU - Yamana, Hayato
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/12/2
Y1 - 2019/12/2
N2 - Phishing is a type of personal information theft in which phishers lure users to steal sensitive information. Phishing detection mechanisms using various techniques have been developed. Our hypothesis is that phishers create fake websites with as little information as possible in a webpage, which makes it difficult for content- A nd visual similarity-based detections by analyzing the webpage content. To overcome this, we focus on the use of Uniform Resource Locators (URLs) to detect phishing. Since previous work extracts specific special-character features, we assume that non- A lphanumeric (NAN) character distributions highly impact the performance of URL-based detection. We hence propose a new feature called the entropy of NAN characters for URL-based phishing detection. Experimental evaluation with balanced and imbalanced datasets shows 96% ROC AUC on the balanced dataset and 89% ROC AUC on the imbalanced dataset, which increases the ROC AUC as 5 to 6% from without adopting our proposed feature.
AB - Phishing is a type of personal information theft in which phishers lure users to steal sensitive information. Phishing detection mechanisms using various techniques have been developed. Our hypothesis is that phishers create fake websites with as little information as possible in a webpage, which makes it difficult for content- A nd visual similarity-based detections by analyzing the webpage content. To overcome this, we focus on the use of Uniform Resource Locators (URLs) to detect phishing. Since previous work extracts specific special-character features, we assume that non- A lphanumeric (NAN) character distributions highly impact the performance of URL-based detection. We hence propose a new feature called the entropy of NAN characters for URL-based phishing detection. Experimental evaluation with balanced and imbalanced datasets shows 96% ROC AUC on the balanced dataset and 89% ROC AUC on the imbalanced dataset, which increases the ROC AUC as 5 to 6% from without adopting our proposed feature.
KW - Detection
KW - Phishing
KW - URL
KW - Webpage
UR - http://www.scopus.com/inward/record.url?scp=85123041829&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85123041829&partnerID=8YFLogxK
U2 - 10.1145/3366030.3366064
DO - 10.1145/3366030.3366064
M3 - Conference contribution
AN - SCOPUS:85123041829
T3 - ACM International Conference Proceeding Series
BT - 21st International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2019 - Proceedings
A2 - Indrawan-Santiago, Maria
A2 - Pardede, Eric
A2 - Salvadori, Ivan Luiz
A2 - Steinbauer, Matthias
A2 - Khalil, Ismail
A2 - Anderst-Kotsis, Gabriele
PB - Association for Computing Machinery
T2 - 21st International Conference on Information Integration and Web-Based Applications and Services, iiWAS 2019
Y2 - 2 December 2019 through 4 December 2019
ER -