arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Multiple Instances

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 200 Contributor
1 Posts
FromEmail posted on Mon, Jun 4 2012 9:34 AM

In one machine, in a windows service, should we create multiple AN "Crawler Instance" to crawl multiple websites? Suppose we want to crawl 10 big websites (Amazon, Bestbuy, ...) and we also want to handle (e.g. start, stop and pause) each website separately. What is the best way to do?

All Replies

Top 10 Contributor
1,694 Posts

Are there 10, or some low number, to be exact? The easiest way to manage a low number of separate websites is to use distinct instances/databases.  There isn't a facility to support start/stop/pause for domains within AN now, and while you could accomplish this via a plugin, management of the data is much easier if it is separated from the start. Look at ApplicationSettings.UniqueIdentifier.  As each Crawler instance will share the ASP.Net cache, ensure that each instance has a unique key so that the crawlers don't crawl the same content.  If you set RestrictCrawlTo.Host(/Domain) for the CrawlRequests then the Crawlers won't overlap in there Discoveries.  If you don't plan to allow non-Domain content (an image on the BestBuy site originating from Amazon.com) then you don't need to set the ApplicationSettings.UniqueIdentifier.  (RestrictDiscoveriesTo.Host(/Domain)) Easiest way to know if you'll need to set the ApplicationSettings.UniqueIdentifier is if you examine you data in the BestBuy.com instance and you see content from any other domain, then, yes, you'll need to set the ApplicationSettings.UniqueIdentifier. Make sense? Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC