arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does arachnode.net scale? | Download the latest release

Compare / Develop Your Own?

arachnode.net

codeplex.com
crawler (Link)

sourceforge.net
crawler
code.google.com
crawler
developing your own
Actively developed Yes No Yes Yes ?
AJAX/JavaScript/DOM integration/interaction Yes No No No ?
Asynchronous content processing Yes No No No ?
Console application Yes Yes Yes No ?
Customization friendly Yes Yes Yes Yes ?
.doc/.pdf/.ppt/.xls indexing Yes No Yes No ?
Distributed crawling/multiple machines Yes No Yes No ?
Dynamic data administration Yes No No No ?
Extensible architecture Yes Yes Yes Yes ?
Forms application Yes No Yes No ?
Full lucene.net integration Yes No No No ?
Graphical user interface Yes No Yes No ?
GZip compression/decompression Yes No Yes Yes ?
High-performance multi-threaded crawling engine Yes Yes Yes Yes ?
HtmlAgilityPack integration Yes Yes Yes Yes ?
Intelligent download throttling Yes No No No ?
International language support Yes Yes No No ?
Link analysis and reporting Yes No No No ?
Local debugging proxy Yes No No No ?
Low memory/processor footprint Yes Yes Yes No ?
Multi-crawler, multi-database capabilities Yes No No No ?
Performance counters Yes No No No ?
Start/stop/pause/continue/save crawls Yes No Yes No ?
Storage direct to SQL 2005/2008 and/or to disk Yes No No No ?
Supports continuous crawling/indexing/searching Yes No No No ?
Unlimited forum/email support Yes No No Yes ?
Tested Yes ? ? ? ?
Web and webservice interface Yes No No No ?
Total cost: $99/499 Free Free Free $5,000+

:: Common Crawler Challenges (Link)

:: Build vs. Buy (Link)

Actively developed: Is the source being developed on a regular schedule or are the updates few and far between?  arachnode.net is largely driven by customer requests and bug reports.

AJAX/JavaScript/DOM integration/interaction: Is head- and headless JavaScript rendering in a true multi-process environment provided, allowing each crawl thread to simultaneously process dynamic web page content?  Most JavaScript rendering engines use a single process, thereby limiting concurrent web page downloads to two.

Asynchronous content processing: When a web page is downloaded, are helper thread utilized to process the content thereby allowing another web page to be downloaded simultaneously?

Console application: Does the source provides an example of how to use the crawler/class library as a command line application?

Customization friendly: Does the source provide a way to easily customize the crawl pipeline?

.doc/.pdf/.ppt/.xls indexing:  Is functionality provided that indexes Office documents, .pdf documents, including the more recent variants of these formats?

Distributed crawling/multiple machines: Can crawling resources be spread across multiple machine and multiple databases?

Dynamic data administration: Can crawl data be manipulated from an administrative web page or is SQL Server Management Studio or another management tool required?

Extensible architecture: Can the code be used in a variety of applications including GUI and service implementations?

Forms application: Is a Windows Forms (or equivalent) project included to create and manage crawl data?

Full lucene.net integration: Is continuous indexing and search supported, including automatic optimization of lucene.net indexes?

Graphical user interface: Is a GUI provided for managing crawl data?

GZip compression/decompression: Is GZip supported when downloading as well as storing and retrieving stored crawl data?

High-performance multi-threaded crawling engine: Does the base crawl rate without customization equal a simple set of threads containing a WebClient pulling from a synchronized queue?

HtmlAgilityPack integration: Is the popular HtmlAgilityPack included and customization examples provided?

Intelligent download throttling: Can the crawler be configured to automatically download data as fast as a webserver will serve the data but no faster?  If web requests are canceled, are they marked as such and requested later?

International language support: Does the crawler properly detect the content-type and decode the character set(s) properly?  If not, do Cyrillic and double-byte character sets look like boxes when debugging?

Local debugging proxy: All WebRequests either in-process or out-of-process can be trapped, examined, cancelled or modified.

Link analysis and reporting: Is discovered content stored in a way that promotes sourcing and popularity data?  Are hyperlinks and their discoveries stored in a single instance store methodology?

Multi-crawler, multi-database capabilities: Is the crawl setup flexible across multiple machines?

Performance counters: Are performance counters included to allow monitoring of crawl rate and cache consumption?

Start/stop/pause/continue/save crawls: Does the crawler allow you to stop a crawl, save to disk and then resume crawling from where you last stopped?

Storage direct to SQL 2005/2008 and/or to disk: Is a way to store the results of a crawl provided and is this storage location queryable and manageable?

Supports continuous crawling/indexing/searching: Can the crawler crawl and update the index while the search facilities search the index without interruption, locking or blocking?

Unlimited forum/email support: Are the authors available and willing to answer your questions and is there a wealthy of knowledge in the form of forum and blog content?

Tested: This goes far beyond writing unit tests for your code.  Has the crawler been verified against crawler test/stress sites such as http://wvtesting2.com and http://test.arachnode.net?  Has the code been in the public forum for a significant length of time to discover bugs, exceptions and edge cases beyond what you or your team could construct?

Web and webservice interface: Is there a web and webservice to manage and query crawled content?

An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC