arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

About scalability

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 75 Contributor
5 Posts
munishgoyal posted on Fri, Oct 15 2010 2:35 AM

 

Hi,

 

I have 2 questions to ask crawl scalability:

 

1. Generally in crawling DNS lookup forms the major bottleneck. To increase overall urls per second crawl throughput  How is DNS lookup/caching speedup addressed in arachnode ? Is there any configuration options on DNS  requests parallelization or DNS caching available ?

 

2. What is the web request politeness policy followed ? basically number of requests/sec to same server or domain ? Is it configurable ? per server/domain basis ? ( basically i want to avoid getting the crawl machine ip banned while crawling)

Thanks

 

Munish

 

All Replies

Top 10 Contributor
1,694 Posts

Munish:

1.) Your question is a bit perplexing to me.  Once the A name or C name is associated with an IP Address, this record is stored locally in your DNS resolver cache.  So, once you know where host 'xyz.com' is located, you do not need to query DNS again for this record.  The TTL (Time to Live) can be configured per machine as well, per IPConfig and Registry settings: http://www.google.com/search?q=dns+resolve+cache&rls=com.microsoft:en-us:IE-SearchBox&ie=UTF-8&oe=UTF-8&sourceid=ie7&rlz=1I7ADBF_en#sclient=psy&hl=en&rls=com.microsoft:en-us%3AIE-SearchBox&rlz=1I7ADBF_en&source=hp&q=dns+resolver+cache&aq=f&aqi=&aql=&oq=&gs_rfai=&pbx=1&fp=f53e9f76dc5c0a18  Perhaps you know something that I don't.  Please, tell me what you are thinking...

2.) There is a CrawlRule entitled 'Frequency.cs' that controls how often a site may be contacted.  The setting is global to all sites, but you could easily examine the code and extend for your purposes to limit the frequency on a per-domain or per-host basis.

Thank you for your questions.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC