Automating URL blacklist generation with similarity search approach

Bo Sun, Mitsuaki Akiyama, Takeshi Yagi, Mitsuhiro Hatada, Tatsuya Mori

Research output: Contribution to journalArticlepeer-review

16 Citations (Scopus)


Modern web users may encounter a browser security threat called drive-by-download attacks when surfing on the Internet. Drive-by-download attacks make use of exploit codes to take control of user's web browser. Many web users do not take such underlying threats into account while clicking URLs. URL Blacklist is one of the practical approaches to thwarting browser-targeted attacks. However, URL Blacklist cannot cope with previously unseen malicious URLs. Therefore, to make a URL blacklist effective, it is crucial to keep the URLs updated. Given these observations, we propose a framework called automatic blacklist generator (AutoBLG) that automates the collection of new malicious URLs by starting from a given existing URL blacklist. The primary mechanism of AutoBLG is expanding the search space of web pages while reducing the amount of URLs to be analyzed by applying several pre-filters such as similarity search to accelerate the process of generating blacklists. AutoBLG consists of three primary components: URL expansion, URL filtration, and URL verification. Through extensive analysis using a high-performance web client honeypot, we demonstrate that AutoBLG can successfully discover new and previously unknown drive-by-download URLs from the vast web space.

Original languageEnglish
Pages (from-to)873-882
Number of pages10
JournalIEICE Transactions on Information and Systems
Issue number4
Publication statusPublished - 2016 Apr


  • Drive-by-download
  • Machine learning
  • Search space
  • URL blacklist
  • Web client honeypot

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering
  • Artificial Intelligence


Dive into the research topics of 'Automating URL blacklist generation with similarity search approach'. Together they form a unique fingerprint.

Cite this