TY - GEN
T1 - History-enhanced focused website segment crawler
AU - Suebchua, Tanaphol
AU - Manaskasemsak, Bundit
AU - Rungsawang, Arnon
AU - Yamana, Hayato
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/4/19
Y1 - 2018/4/19
N2 - The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a 'history-enhanced focused website segment crawler' to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the 'history feature', that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.
AB - The primary challenge in focused crawling research is how to efficiently utilize computing resources, e.g., bandwidth, disk space, and time, to find as many web pages related to a specific topic as possible. To meet this challenge, we previously introduced a machine-learning-based focused crawler that aims to crawl a group of relevant web pages located in the same directory path, called a website segment, and has achieved high efficiency so far. One of the limitations of our previous approach is that it may repeatedly visit a website that does not serve any relevant website segments, in the scenario where the website segments share the same linkage characteristics as the relevant ones in the training dataset. In this paper, we propose a 'history-enhanced focused website segment crawler' to solve the problem. The idea behind it is that the priority score of an unvisited website segment should be reduced if the crawler has consecutively downloaded many irrelevant web pages from the website. To implement this idea, we propose a new prediction feature, called the 'history feature', that is extracted from the recent crawling results, i.e., relevant and irrelevant web pages gathered from the target website. Our experiment shows that our newly proposed feature could improve the crawling efficiency of our focused crawler by a maximum of approximately 5%.
KW - Focused crawler
KW - Machine learning
KW - Topic-specific web crawler
KW - Vertical search engine
UR - http://www.scopus.com/inward/record.url?scp=85046998816&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85046998816&partnerID=8YFLogxK
U2 - 10.1109/ICOIN.2018.8343090
DO - 10.1109/ICOIN.2018.8343090
M3 - Conference contribution
AN - SCOPUS:85046998816
T3 - International Conference on Information Networking
SP - 80
EP - 85
BT - 32nd International Conference on Information Networking, ICOIN 2018
PB - IEEE Computer Society
T2 - 32nd International Conference on Information Networking, ICOIN 2018
Y2 - 10 January 2018 through 12 January 2018
ER -