A fast community based algorithm for generating web crawler seeds set

Please enable javascript in your browser.

Daneshpajouh, S ; Sharif University of Technology | 2008

316 Viewed

Type of Document: Article
Publisher: 2008
Abstract:
In this paper, we present a new and fast algorithm for generating the seeds set for web crawlers. A typical crawler normally starts from a fixed set like DMOZ links, and then continues crawling from URLs found in these web pages. Crawlers are supposed to download more good pages in less iterations. Crawled pages are good if they have high PageRanks and are from different communities. In this paper, we present a new algorithm with O(n) running time for generating crawler's seeds set based on HITS algorithm. A crawler can download qualified web pages, from different communities, starting from generated seeds set using our algorithm in less iteration
Keywords:
Electronic commerce ; Government data processing ; Hypertext systems ; Information systems ; Seed ; Websites ; World wide web ; Communities ; Crawl quality metric ; Crawling ; HITS ; Hyperlink analysis ; Seed quality metric ; Web graph ; Quality control
Source: WEBIST 2008 - 4th International Conference on Web Information Systems and Technologies, Funchal, Madeira, 4 May 2008 through 7 May 2008 ; Volume 2 , 2008 , Pages 98-105 ; 9789898111265 (ISBN)
URL: http://sharif.ir/~ghodsi/papers/shervin-nasiri-webist2008.pdf