arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does arachnode.net scale? | Download the latest release
NCrawler

NCrawler is an extremely well written application, following intelligent programming standards and organizational practices.  Esben Carlsen (Link), the author, is undoubtedly a gifted programmer and has demonstrated a level of programming proficiency that is well above and beyond the majority of code that I have worked with in my professional programming career.  In fact, I was so impressed by NCrawler's pipeline organization that I adopted the notion of attaching steps via delegates into arachnode.net and this served as the inspiration for how AN.Next's plugin architecture operates.

Things I like about NCrawler.

1.) The pipeline architecture.

  • I think the simple attachment of plugins (pipeline steps) in order of execution is an elegant way to wire plugins, as opposed to reading from a .config file.  I think the use of params[] is intelligent design.

2.) The property bag.

  • arachnode.net is similar in that a hierarchy is provided, as well as a 'Tag/Object' property.  AN.Next is a bit more like the property bag in NCrawler in the organization of pipeline properties.  Overall, I enjoy the similarity.

No other .NET crawler is even close to NCrawler in terms of completeness, code quality and scaffolding required to be completed into a viable large-scale crawler.  And yet, NCrawler is far behind arachnode.net.

However, there are a number of things that should discourage you from implementing NCrawler in your projects.

While I am biased, and do want your business, I want to save you time in NOT discovering what NCrawler is missing on your own.  Developing a web crawler is almost entirely an exercise in the unknown and in patience.  You won't know what you don't know until you stumble upon it or someone else points it out to you and whether stumbling or discovery through other you'll have to wait for these illuminations.  Esben, the author of NCrawler, took a little over two years to get NCrawler to where it is today, at least as far as the commits show.  It is likely that at least a year was spent on NCrawler before the first initial checkin.  The total time to get NCrawler to a point where it presented as a solid framework to work from and to fill in the quirks should be a solid factor in determining that you probably shouldn't try and write a crawler yourself, but rather spend $99/$249 and receive support for a product which is mature and works.

If your choice to use NCrawler comes down to $99/$249 vs. $FREE, I would rather you pay me what you are able over spending time implementing any of the fixes below.  Your time is worth money, and while NCrawler is a GREAT start it ultimately isn't ready for large-scale production use.

1.) It's a dead project.  There haven't been any code updates in over a year and most support requests are unanswered.

  • I have over 15,000 posts in response to users' questions.  Most questions are answered within the day.  I am available by phone at: (650) 76A-NODE (762-6633)  I am available 8-10 hours a day via Skype at: arachnodedotnet

2.) It's slow.  Matching configurations among arachnode.net, AN.Next and NCrawler yielded a 40% decrease in performance for NCrawler using SimpleCrawlDemo.Run();

  • SimpleCrawlDemo doesn't use disk-backed storage and in entirely RAM based, while arachnode.net and AN.Next use both and both were faster even after incurring the overhead of caching to disk as well.
  • The crawl threads in NCrawler do not have a backing/helper thread as in arachnode.net.  Once a crawl thread in NCrawler has finished downloading and begins whatever processing is wired in the pipelines step crawling effectively stops for that thread.  If you have long running operations (1-2 seconds is a long running operation) in an NCrawler plugin then your anticipated 10 threads often reduces to '0' threads crawling.  As downloading the byte[] array from a WebPage isn't a computationally expensive operation (but may take 1-2 seconds) it makes sense to pass the data to a helper thread to finish while the downloading thread downloads another AbsoluteUri.  You would either have to implement your own async methodology in NCrawler.  arachnode.net implemented an async helper thread for you.

3.) There is no way to stop a crawl and save the state of that crawl.  Once a crawl is cancelled, there is no way to resume where the crawl left off.

  • In a multi-million page crawl if you need to stop the crawl for any reason, you'll have to start over with NCrawler.  Imagine crawling for a day and needing to make a change and having to start completely over.

4.) There is no distinction between depth-first crawling and breadth-first crawling.  The freshest content is most always found at or near the root of a site, and if limited by time and the desire for the 'best' content, breadth-first crawling is preferred.  NCrawler crawls based on what it finds first without distinction.

  • http://en.wikipedia.org/wiki/Breadth-first_search
  • https://www.google.com/search?q=breadth-first+crawling&rlz=1C1CHFX_enUS494US494&oq=breadth-first+crawling&aqs=chrome.0.57j0l3j62l2.5131&sugexp=chrome,mod=16&sourceid=chrome&ie=UTF-8

5.) NCrawler doesn't close the WebRequest/WebResponse streams.

  • .NET doesn't close these by default and thus memory consumption increases without bounds.

6.) No way to set maximum memory usage and no sliding-window cache.

  • A Crawler should be able to limit itself in RAM consumption, allow stale cache entries to expire to allow fresh ones in, and to cache overflow to disk.

7.) No performance counters.

  • There isn't a way to determine how fast NCrawler is crawling, and what impact on performance your changes have made.

8.) There isn't a way to restrict a crawl to a domain and/or subdomain.

  • If you wanted to crawl 'facebook.com' with NCrawler, you'd have to write the domain parsing yourself.  System.Uri is NOT adequate to determine the Host/Domain/Subdomain of an AbsoluteUri.
  • AN provides explicit mapping and Host/Domain/Subdomain parsing and restriction and handles cases like 'ebay.co.uk'.  (ebay is NOT the subdomain...)

9.) NCrawler uses a scheduled ThreadPool, which is slower.

  • A Crawler should use dedicated threads, one for each intended simultaneous download.  There is no need for a crawl process to wait for an available thread.  Spin up and re-use exactly what you need.

10.) There is no way to supply a list of AbsoluteUris to crawl to NCrawler.

  • If you know the distinct list of AbsoluteUris you'd like to crawl you can disable disk caching with AN and crawl at a faster rate.  NCrawler only takes a single AbsoluteUri as a starting parameter.  It shouldn't be necessary to instantiate a new Crawler instance for each AbsoluteUri you wish to crawl.  There is no way to feed additional AbsoluteUris to a running crawl.

11.) NCrawler provides a mechanism to filter images and files and other content types, but does so via an extension association.

  • This is incorrect.  A .jpg may be downloaded from an AbsoluteUri ending in .htm.
  • Incorrect filtering means you WILL download images and files, even if you don't intend to.

12.) No storage of downloaded data.

  • If you wanted to actually save what you downloaded with NCrawler, you'd have to implement the storage yourself.
  • There is no logging of exceptions 

13.) No prioritization of AbsoluteUris.

  • There are many cases where you'd like to crawl by priority, such that a link to google.com will be crawled before altavista.com.  NCrawler does not provide functionality to prioritize crawling.

14.) No support for JavaScript.

  • This isn't trivial, at all.  To achieve true multi-process HTML rendering you'll need to implement cross-process messaging.  Using the .NET WebBrowser control isn't advisable as it doesn't support more than two simultaneous renderings.  Additionally, the WebBrowser control is EXTREMELY expensive.  arachnode.net uses mshtml.dll, which powers AxShDowVw.dll which powers the WebBrower control, but is 6-10 times faster and uses 1/10th CPU time.

 


Posted Mon, Dec 3 2012 9:20 AM by arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC