Crawl peer and database peer are useful components for splitting crawling tasks. I have a few questions about them:
1.) See what the limitations of crawling on a single machine are - they are positives/negatives in all things in life, and of course, in cross-process/machine communication there are positives/negatives as well. Set a time limit for the crawl. Use one crawl process, say, 20 threads. What is the WebPage count?
2.) Next, try 5 machines, 4 threads each. Set a time limit for the crawl. What is the WebPage count?
3.) Next, try 5 machines, 4 threads each, but use distinct/separate databases. Set a time limit for the crawl. What is the WebPage count?
4.) Realize there are two mindsets in crawling, crawling and information retrieval: First stage, crawling, which generates a nice list of links, second, information retrieval, for which you don't need to worry about Discoveries or cross-process/machine communication. As a point, envision what it would take for Google (when crawling) to inform every other machine involved in the crawling. Google doesn't operate like this - they have a set of machines used for discovering new links and a set of machines used for revisiting - there is little (no) need for machines crawling yahoo.com to let the machines crawling arachnode.net know of yahoo.com's progress. So, this conveyed - look at EngineActions - if you have one set of machines dedicated to Crawling, then you'll probably want to enable cross-process/machine communication - for the 'information retrieval'/re-crawl - point GetCrawlRequests(...); at a list (DB) of CrawlRequests, SERIALIZE, and delete before ending the transaction - you won't need to worry about cross-process/machine synchronization - you can actually turn off inserting Discoveries - ApplicationSettings.InsertDiscoveries - read Cache.cs, the whole thing.
5.) No two crawling scenarios are exactly alike - AN does its best to be a one-size-fits-all crawling engine but some thinking/planning does need to be applied to the application of AN's technology when multiple machines are employed.
I will follow up with more information, but please let me know your results...
For best service when you require assistance:
Thanks for the detailed explaination! I will try the single/multiple processes cases and let you know.
One more question: to increase the number of threads, should I just modify "MaximumNumberOfCrawlThreads" or I need to add corresponding "127.0.0.1"s to ProxyServers.txt?
I am writing up some instructions with screenshots using a virtual machine to simulate cross-machine caching.
Did you finish the documents for setting up on multiple machines. Is there any information I can have on setting up multiple machines.
I dug into the code (I use something a bit different for my scale-out crawl scenarios) and decided to polish up a few things...
I am working on checking in what I have - this new code addresses a LOT of conditions...
Look for something, tomorrow... (there are still a couple of bugs/conditions to solve...)
Just one more condition to solve, so it seems... perhaps tomorrow is the day.
Looking forward to the info. Thank you so much for all the hard work.
Did the documentation ever get finished?
Yes, and this code is available to the Commercial License.