|
arachnode.net |
codeplex.com crawler (Link)
|
sourceforge.net crawler |
code.google.com crawler |
developing your own |
| Actively developed
|
Yes |
No |
Yes |
Yes |
? |
| AJAX/JavaScript/DOM integration/interaction |
Yes |
No |
No |
No |
? |
| Asynchronous content processing
|
Yes |
No |
No |
No |
? |
| Console application
|
Yes |
Yes |
Yes |
No |
? |
| Customization friendly
|
Yes |
Yes |
Yes |
Yes |
? |
| .doc/.pdf/.ppt/.xls indexing
|
Yes |
No |
Yes |
No |
? |
| Distributed crawling/multiple machines
|
Yes |
No |
Yes |
No |
? |
| Dynamic data administration
|
Yes |
No |
No |
No |
? |
| Extensible architecture
|
Yes |
Yes |
Yes |
Yes |
? |
| Forms application
|
Yes |
No |
Yes |
No |
? |
| Full lucene.net integration
|
Yes |
No |
No |
No |
? |
| Graphical user interface
|
Yes |
No |
Yes |
No |
? |
| GZip compression/decompression
|
Yes |
No |
Yes |
Yes |
? |
| High-performance multi-threaded crawling engine
|
Yes |
Yes |
Yes |
Yes |
? |
| HtmlAgilityPack integration
|
Yes |
Yes |
Yes |
Yes |
? |
| Intelligent download throttling
|
Yes |
No |
No |
No |
? |
| International language support
|
Yes |
Yes |
No |
No |
? |
| Link analysis and reporting
|
Yes |
No |
No |
No |
? |
| Local debugging proxy
|
Yes |
No |
No |
No |
? |
| Low memory/processor footprint
|
Yes |
Yes |
Yes |
No |
? |
| Multi-crawler, multi-database capabilities
|
Yes |
No |
No |
No |
? |
| Performance counters |
Yes |
No |
No |
No |
? |
| Start/stop/pause/continue/save crawls
|
Yes |
No |
Yes |
No |
? |
| Storage direct to SQL 2005/2008 and/or to disk
|
Yes |
No |
No |
No |
? |
| Supports continuous crawling/indexing/searching |
Yes |
No |
No |
No |
? |
| Unlimited forum/email support
|
Yes |
No |
No |
Yes |
? |
| Tested
|
Yes |
? |
? |
? |
? |
| Web and webservice interface
|
Yes |
No |
No |
No |
? |
| Total cost:
|
$99 / $499
|
Free
|
Free |
Free |
$5,000+ |
:: Common Crawler Challenges (Link)
Actively developed: Is the source being developed on a regular schedule or are the updates few and far between? arachnode.net is largely driven by customer requests and bug reports.
AJAX/JavaScript/DOM integration/interaction: Is head- and headless JavaScript rendering in a true multi-process environment provided, allowing each crawl thread to simultaneously process dynamic web page content? Most JavaScript rendering engines use a single process, thereby limiting concurrent web page downloads to two.
Asynchronous content processing: When a web page is downloaded, are helper thread utilized to process the content thereby allowing another web page to be downloaded simultaneously?
Console application: Does the source provides an example of how to use the crawler/class library as a command line application?
Customization friendly: Does the source provide a way to easily customize the crawl pipeline?
.doc/.pdf/.ppt/.xls indexing: Is functionality provided that indexes Office documents, .pdf documents, including the more recent variants of these formats?
Distributed crawling/multiple machines: Can crawling resources be spread across multiple machine and multiple databases?
Dynamic data administration: Can crawl data be manipulated from an administrative web page or is SQL Server Management Studio or another management tool required?
Extensible architecture: Can the code be used in a variety of applications including GUI and service implementations?
Forms application: Is a Windows Forms (or equivalent) project included to create and manage crawl data?
Full lucene.net integration: Is continuous indexing and search supported, including automatic optimization of lucene.net indexes?
Graphical user interface: Is a GUI provided for managing crawl data?
GZip compression/decompression: Is GZip supported when downloading as well as storing and retrieving stored crawl data?
High-performance multi-threaded crawling engine: Does the base crawl rate without customization equal a simple set of threads containing a WebClient pulling from a synchronized queue?
HtmlAgilityPack integration: Is the popular HtmlAgilityPack included and customization examples provided?
Intelligent download throttling: Can the crawler be configured to automatically download data as fast as a webserver will serve the data but no faster? If web requests are canceled, are they marked as such and requested later?
International language support: Does the crawler properly detect the content-type and decode the character set(s) properly? If not, do Cyrillic and double-byte character sets look like boxes when debugging?
Local debugging proxy: All WebRequests either in-process or out-of-process can be trapped, examined, cancelled or modified.
Link analysis and reporting: Is discovered content stored in a way that promotes sourcing and popularity data? Are hyperlinks and their discoveries stored in a single instance store methodology?
Multi-crawler, multi-database capabilities: Is the crawl setup flexible across multiple machines?
Performance counters: Are performance counters included to allow monitoring of crawl rate and cache consumption?
Start/stop/pause/continue/save crawls: Does the crawler allow you to stop a crawl, save to disk and then resume crawling from where you last stopped?
Storage direct to SQL 2005/2008 and/or to disk: Is a way to store the results of a crawl provided and is this storage location queryable and manageable?
Supports continuous crawling/indexing/searching: Can the crawler crawl and update the index while the search facilities search the index without interruption, locking or blocking?
Unlimited forum/email support: Are the authors available and willing to answer your questions and is there a wealthy of knowledge in the form of forum and blog content?
Tested: This goes far beyond writing unit tests for your code. Has the crawler been verified against crawler test/stress sites such as http://wvtesting2.com and http://test.arachnode.net? Has the code been in the public forum for a significant length of time to discover bugs, exceptions and edge cases beyond what you or your team could construct?
Web and webservice interface: Is there a web and webservice to manage and query crawled content?