|
CobWeb (da Silva et al., 1999)
CobWeb uses a central "scheduler" and a series of distributed "collectors". The collectors parse the downloaded Web pages and send the discovered URLs to the scheduler, which in turn assign them to the collectors. The scheduler enforces a breadth-first search order with a politeness policy to avoid overloading Web servers. The crawler is written in Perl.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
FAST Crawler (Risvik and Michelsen, 2002)
FAST Crawler is the crawler used by the FAST search engine, and a general description of its architecture is available. It is a distributed architecture in which each machine holds a "document scheduler" that maintains a queue of documents to be downloaded by a "document processor" that stores them in a local storage subsystem. Each crawler communicates with the other crawlers via a "distributor" module that exchanges hyperlink information.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
World Wide Web Worm (McBryan, 1994)
World Wide Web Worm was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
Google Crawler (Brin and Page, 1998)
Google Crawler is described in some detail, but the reference is only about an early version of its architecture, which was based in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.
|
|
0 |
0 |
|
Mercator (Heydon and Najork, 1999; Najork and Heydon, 2001)
Mercator is a distributed, modular web crawler written in Java. Its modularity arises from the usage of interchangeable "protocol modules" and "processing modules". Protocols modules are related to how to acquire the Web pages (e.g.: by HTTP), and processing modules are related to how to process Web pages. The standard processing module just parses the pages and extract new URLs, but other processing modules can be used to index the text of the pages, or to gather statistics from the Web.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
HotCrawler
HotCrawler is a crawler written in C, and PHP. HotCrawler crawls websites by visiting a list of URLs listed in its database, and it adds new URLs to its queue as it find them, and it's separated from the search engine. If the URL is already crawled through the queue session, it adds it to the last queue session created. It's kind of two separated programs, a one that downloads pages and saves copies of it in a database, and another program that determine the next time to visit a page, based on many factors.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
RBSE (Eichmann, 1994)
RBSE was the first published web crawler. It was based on two programs: the first program, "spider" maintains a queue in a relational database, and the second program "mite", is a modified www ASCII browser that downloads the pages from the Web.
|
|
0 |
0 |
|
PolyBot (Shkapenyuk and Suel, 2002)
PolyBot is a distributed crawler written in C++ and Python, which is composed of a "crawl manager", one or more "downloaders" and one or more "DNS resolvers". Collected URLs are added to a queue on disk, and processed later to search for seen URLs in batch mode. The politeness policy considers both third and second level domains (e.g.: www.example.com and www2.example.com are third level domains) because third level domains are usually hosted by the same Web server.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
Labrador
Labrador is a closed-source web crawler that works with the Open Source project Terrier search engine
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
Ubicrawler (Boldi et al., 2004)
Ubicrawler is a distributed crawler written in Java, and it has no central process. It is composed of a number of identical "agents"; and the assignment function is calculated using consistent hashing of the host names. There is zero overlap, meaning that no page is crawled twice, unless a crawling agent crashes (then, another agent must re-crawl the pages from the failing agent). The crawler is designed to achieve high scalability and to be tolerant to failures.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
WebFountain (Edwards et al., 2001)
WebFountain is a distributed, modular crawler similar to Mercator but written in C++. It features a "controller" machine that coordinates a series of "ant" machines. After repeatedly downloading pages, a change rate is inferred for each page and a non-linear programming method must be used to solve the equation system for maximizing freshness. The authors recommend to use this crawling order in the early stages of the crawl, and then switch to a uniform crawling order, in which all pages are being visited with the same frequency.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
WebCrawler (Pinkerton, 1994)
WebCrawler was used to build the first publicly-available full-text index of a subset of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |
|
Spinn3r
Spinn3r is a crawler used to build Tailrank. Spinn3r is based on Java and the majority of its architecture is Open Source. Spinn3r is mostly oriented around crawling the blogosphere.
|
Useful information...
02-14-2008 11:12
|
1 |
1 |