There are a number of choices for .NET crawlers. A few of them are extremely well architected, following TDD practices and the best patterns for ultimate extensibility. While most are excellent starts, none but arachnode.net provide sliding-window caching to RAM and to disk, mechanisms for data storage and is suitable for use beyond small hobby crawls.
Almost all .NET crawler projects progress as far as 'Get the web page, parse the hyperlinks and repeat' yet not much farther. Why? In many ways, crawling the internet is a solved problem. Developing a scalable web crawler that is ACID compliant, provides data storage, performance counters, a console, GUI, web and a service application requires a significant investment in time and money beyond where most other crawlers stop.
"Crawling the web is easy; crawling it efficiently is not."
Nutch (Java) was in development for 7 years before it reached Version 1.0. Hundreds of people contributed to Nutch, had backing from Yahoo! executives and ultimately stored data in a proprietary format which did not scale (or operate) beyond a few million documents. Heritrix encourages users to skip Version 2.0 entirely. arachnode.net is 9 years old and is developed on almost every day. There have been a few rough patches, a few breaking changes and more than a few (a lot of few) nights, holidays and weekends coding, communicating with users, fixing bugs, updating marketing and documentation ultimately losing sleep to provide a solid, thorough and complete .NET web crawler.
Therefore... questions to ask of new crawler projects:
1.) Is the crawler you are considering caching only to RAM? If it is then...
- ...when you run out of RAM, the crawl (and your application) stops. There is no accurate method to determine how many pages you will discover at any one depth of a crawl and to determine how much RAM you will need to complete the crawl.
- ...you will be unable to interrupt/resume a crawl. Large crawls frequently require modification after the crawl has started.
- ...does the reference expire and allow other references in?
2.) Is the crawler you are considering actually saving what you crawl? If it isn't then...
- ...how will you determine where the crawl has been to filter where it should go?
- ...how much time will it take to design an efficient and scalable database design?
- ...how much time will it take you to figure out what Windows will accept for storage paths?
- ...do you actually want to examine what you have crawled?
3.) Does the crawler provide a mechanism to measure performance other than console output? If it doesn't...
- ...how will you determine when changes to the crawler have positively/negatively affected performance?
4.) Is the crawler you are considering using a priority queue, with round-robin domain shuffling or it is just a FIFO operation? If all of the above aren't implemented...
- ...the likelihood of being blocked increases significantly.
- ...the crawl rate will slow when all threads latch onto one domain.
5.) Is the crawler you are considering a re-implementation of arachnode.net/AN.Next? If it is then...
- ...why encounter new bugs from an incomplete re-write?
- ...why spend your time and money waiting for the features of arachnode.net/AN.Next to be implemented in your fledgling crawler?
Please seriously consider how much effort is required to produce a viable web crawler. I had absolutely no idea how much effort was required to create and maintain arachnode.net/AN.Next. Given the choice, I would have chosen an easier and less time-intensive project. It really is something to consider.
Sun, Mar 24 2013 5:04 PM