Hi,
I have 2 questions to ask crawl scalability:
1. Generally in crawling DNS lookup forms the major bottleneck. To increase overall urls per second crawl throughput How is DNS lookup/caching speedup addressed in arachnode ? Is there any configuration options on DNS requests parallelization or DNS caching available ?
2. What is the web request politeness policy followed ? basically number of requests/sec to same server or domain ? Is it configurable ? per server/domain basis ? ( basically i want to avoid getting the crawl machine ip banned while crawling)
Thanks
Munish
Munish:
1.) Your question is a bit perplexing to me. Once the A name or C name is associated with an IP Address, this record is stored locally in your DNS resolver cache. So, once you know where host 'xyz.com' is located, you do not need to query DNS again for this record. The TTL (Time to Live) can be configured per machine as well, per IPConfig and Registry settings: http://www.google.com/search?q=dns+resolve+cache&rls=com.microsoft:en-us:IE-SearchBox&ie=UTF-8&oe=UTF-8&sourceid=ie7&rlz=1I7ADBF_en#sclient=psy&hl=en&rls=com.microsoft:en-us%3AIE-SearchBox&rlz=1I7ADBF_en&source=hp&q=dns+resolver+cache&aq=f&aqi=&aql=&oq=&gs_rfai=&pbx=1&fp=f53e9f76dc5c0a18 Perhaps you know something that I don't. Please, tell me what you are thinking...
2.) There is a CrawlRule entitled 'Frequency.cs' that controls how often a site may be contacted. The setting is global to all sites, but you could easily examine the code and extend for your purposes to limit the frequency on a per-domain or per-host basis.
Thank you for your questions.
Mike
For best service when you require assistance:
Skype: arachnodedotnet