arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

crawl peer and database peer questions

rated by 0 users
Answered (Verified) This post has 1 verified answer | 9 Replies | 3 Followers

Top 25 Contributor
19 Posts
moormoon posted on Tue, Apr 28 2015 3:08 AM

Crawl peer and database peer are useful components for splitting crawling tasks. I have a few questions about them:

 

  1. To start multiple crawler peers, do I just need to start multiple crawling processes? Seems starting multiple crawling processes in the same machine will have conflicts in the downloaded folders, unless setting different downloaded folders for different processes, is it true? If so, how to set different downloaded folders without changing the configuration database, i.e., different processes still share the same configuration database?
  2. If they are running in different machines, I just need to set the "crawlerPeers.Add(new CrawlerPeer(IPAddress.Parse("Paradise"), [port])) for each process, right?
  3. If I want all crawling processes to collaborate crawling the same domain (e.g., yahoo.com), should I use a single database and have all crawling processes point to it? These crawling processes will communicate with each other's cache, and avoid crawling duplicate webpages, right?
  4. Each crawler can have only 1 database peer, right? ConnectionStrings.config is the only place to specify the db connection string, right?
  5. When should I use "DatabasePeer databasePeer = new DatabasePeer(ApplicationSettings.ConnectionString);" ? 

 

Thanks,

Alex

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

 

  1. One Crawl Process / One Domain Set / Discoveries From other Domains (Not/)Allowed : No potential conflicts.
    One Crawl Process / Multiple Domains Set / Discoveries From other Domains (Not/)Allowed : No potential conflicts.
    Multiple Crawl Processes / One Domain Set / Discoveries From other Domains Not Allowed : Potential (rare) conflicts.
    Multiple Crawl Processes / One Domain Set / Discoveries From other Domains Allowed : Potential (rare) conflicts.

    OK, so, if you are crawling the same domain set from multiple crawl processes consider (do this, actually) using one crawl process.  There is no benefit to using multiple crawl processes to crawl the same domain set.

    Why rare conflicts (?) - cross process communications aren't synchronized (there would have to be a global Mutex involved) - doing so slows down any process tremendously - obviously, in every case involving threads and process synchronization, having to block all thread/processes outside of the traditional 'lock' statement essentially turns the whole system into a SERIALIZED procession.  The time taken for a few duplicate pages to be downloaded is far eclipsed by the time taken to synchronize all.

    Multiple Crawl Process / Multiple Domains Set / Discoveries From other Domains Not Allowed : Potential (more rare) conflicts.
    Multiple Crawl Process / Multiple Domains Set / Discoveries From other Domains Allowed : Potential (more rare) conflicts.

    Due to the fact of spreading your crawling across multiple domains the likelihood of overlapping I/O decreases.

    DownloadedFolders can be set in ApplicationSettings.cs - read Console\Program.cs in entirety - this answers all questions about how configurations may be set from code instead of from the database.  (AssignApplicationSettingsFor*, specifically).

    Remember, the Discoveries synchronize all - so the Directories are accounted for.  If you want to have a common store for the Directories, use network shares or use a DFS. 
  2. Yes.  The purpose of CrawlerPeers is to facilitate cross-process (ideally, cross-machine) communication when using separate databases.

    (5 machines / Database A) - (5 machines / Database B) - by default, in all crawl configurations, each machine will check their local RAM cache to see if a Discovery is present, it is not, then they will check the database.  If you have assigned CrawlerPeers then before checking the Database, the process will check the local RAM cache for a CrawlerPeer specific cache key - if no match is found each CrawlerPeer is sent a copy of this cache key. 
  3. Yes - if all crawling processes point to the same database then there is no need to use the CrawlerPeers.
  4. Database Peers are part of an Enterprise license, which is not checked into the SVN source which you have access to under your Personal/Commercial license.
  5. Same as #4.

Some hints/tips:

1.) See what the limitations of crawling on a single machine are - they are positives/negatives in all things in life, and of course, in cross-process/machine communication there are positives/negatives as well.  Set a time limit for the crawl.  Use one crawl process, say, 20 threads.  What is the WebPage count?

2.) Next, try 5 machines, 4 threads each.  Set a time limit for the crawl.  What is the WebPage count?

3.) Next, try 5 machines, 4 threads each, but use distinct/separate databases.  Set a time limit for the crawl.  What is the WebPage count?

4.) Realize there are two mindsets in crawling, crawling and information retrieval: First stage, crawling, which generates a nice list of links, second, information retrieval, for which you don't need to worry about Discoveries or cross-process/machine communication.  As a point, envision what it would take for Google (when crawling) to inform every other machine involved in the crawling.  Google doesn't operate like this - they have a set of machines used for discovering new links and a set of machines used for revisiting - there is little (no) need for machines crawling yahoo.com to let the machines crawling arachnode.net know of yahoo.com's progress.  So, this conveyed - look at EngineActions - if you have one set of machines dedicated to Crawling, then you'll probably want to enable cross-process/machine communication - for the 'information retrieval'/re-crawl - point GetCrawlRequests(...); at a list (DB) of CrawlRequests, SERIALIZE, and delete before ending the transaction - you won't need to worry about cross-process/machine synchronization - you can actually turn off inserting Discoveries - ApplicationSettings.InsertDiscoveries - read Cache.cs, the whole thing.

5.) No two crawling scenarios are exactly alike - AN does its best to be a one-size-fits-all crawling engine but some thinking/planning does need to be applied to the application of AN's technology when multiple machines are employed.

I will follow up with more information, but please let me know your results...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

 

  1. One Crawl Process / One Domain Set / Discoveries From other Domains (Not/)Allowed : No potential conflicts.
    One Crawl Process / Multiple Domains Set / Discoveries From other Domains (Not/)Allowed : No potential conflicts.
    Multiple Crawl Processes / One Domain Set / Discoveries From other Domains Not Allowed : Potential (rare) conflicts.
    Multiple Crawl Processes / One Domain Set / Discoveries From other Domains Allowed : Potential (rare) conflicts.

    OK, so, if you are crawling the same domain set from multiple crawl processes consider (do this, actually) using one crawl process.  There is no benefit to using multiple crawl processes to crawl the same domain set.

    Why rare conflicts (?) - cross process communications aren't synchronized (there would have to be a global Mutex involved) - doing so slows down any process tremendously - obviously, in every case involving threads and process synchronization, having to block all thread/processes outside of the traditional 'lock' statement essentially turns the whole system into a SERIALIZED procession.  The time taken for a few duplicate pages to be downloaded is far eclipsed by the time taken to synchronize all.

    Multiple Crawl Process / Multiple Domains Set / Discoveries From other Domains Not Allowed : Potential (more rare) conflicts.
    Multiple Crawl Process / Multiple Domains Set / Discoveries From other Domains Allowed : Potential (more rare) conflicts.

    Due to the fact of spreading your crawling across multiple domains the likelihood of overlapping I/O decreases.

    DownloadedFolders can be set in ApplicationSettings.cs - read Console\Program.cs in entirety - this answers all questions about how configurations may be set from code instead of from the database.  (AssignApplicationSettingsFor*, specifically).

    Remember, the Discoveries synchronize all - so the Directories are accounted for.  If you want to have a common store for the Directories, use network shares or use a DFS. 
  2. Yes.  The purpose of CrawlerPeers is to facilitate cross-process (ideally, cross-machine) communication when using separate databases.

    (5 machines / Database A) - (5 machines / Database B) - by default, in all crawl configurations, each machine will check their local RAM cache to see if a Discovery is present, it is not, then they will check the database.  If you have assigned CrawlerPeers then before checking the Database, the process will check the local RAM cache for a CrawlerPeer specific cache key - if no match is found each CrawlerPeer is sent a copy of this cache key. 
  3. Yes - if all crawling processes point to the same database then there is no need to use the CrawlerPeers.
  4. Database Peers are part of an Enterprise license, which is not checked into the SVN source which you have access to under your Personal/Commercial license.
  5. Same as #4.

Some hints/tips:

1.) See what the limitations of crawling on a single machine are - they are positives/negatives in all things in life, and of course, in cross-process/machine communication there are positives/negatives as well.  Set a time limit for the crawl.  Use one crawl process, say, 20 threads.  What is the WebPage count?

2.) Next, try 5 machines, 4 threads each.  Set a time limit for the crawl.  What is the WebPage count?

3.) Next, try 5 machines, 4 threads each, but use distinct/separate databases.  Set a time limit for the crawl.  What is the WebPage count?

4.) Realize there are two mindsets in crawling, crawling and information retrieval: First stage, crawling, which generates a nice list of links, second, information retrieval, for which you don't need to worry about Discoveries or cross-process/machine communication.  As a point, envision what it would take for Google (when crawling) to inform every other machine involved in the crawling.  Google doesn't operate like this - they have a set of machines used for discovering new links and a set of machines used for revisiting - there is little (no) need for machines crawling yahoo.com to let the machines crawling arachnode.net know of yahoo.com's progress.  So, this conveyed - look at EngineActions - if you have one set of machines dedicated to Crawling, then you'll probably want to enable cross-process/machine communication - for the 'information retrieval'/re-crawl - point GetCrawlRequests(...); at a list (DB) of CrawlRequests, SERIALIZE, and delete before ending the transaction - you won't need to worry about cross-process/machine synchronization - you can actually turn off inserting Discoveries - ApplicationSettings.InsertDiscoveries - read Cache.cs, the whole thing.

5.) No two crawling scenarios are exactly alike - AN does its best to be a one-size-fits-all crawling engine but some thinking/planning does need to be applied to the application of AN's technology when multiple machines are employed.

I will follow up with more information, but please let me know your results...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

Thanks for the detailed explaination! I will try the single/multiple processes cases and let you know.

One more question: to increase the number of threads, should I just modify "MaximumNumberOfCrawlThreads" or I need to add corresponding "127.0.0.1"s to ProxyServers.txt?

Top 10 Contributor
1,905 Posts

Of course!

Either works.

I am writing up some instructions with screenshots using a virtual machine to simulate cross-machine caching.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
23 Posts
JCrawl replied on Tue, May 12 2015 3:54 AM

Did you finish the documents for setting up on multiple machines. Is there any information I can have on setting up multiple machines.

Thank you

 

Top 10 Contributor
1,905 Posts

I dug into the code (I use something a bit different for my scale-out crawl scenarios) and decided to polish up a few things...

I am working on checking in what I have - this new code addresses a LOT of conditions...

  • AlwaysOn Crawlers.
  • Crawlers joining a Crawl.
  • Shuttling CrawlRequests to other Crawlers when a Crawler is shut down using CTRL+C.
  • Consuming TCP/IP buffers so a machine can join a Crawl even if it wasn't listening, explicitly.

Look for something, tomorrow...  (there are still a couple of bugs/conditions to solve...)

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Just one more condition to solve, so it seems...  perhaps tomorrow is the day.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
23 Posts
JCrawl replied on Thu, May 14 2015 9:51 PM

Looking forward to the info. Thank you so much for all the hard work.

 

Top 25 Contributor
23 Posts

Hey Mike 

Did the documentation ever get finished?

 

Thanks

Top 10 Contributor
1,905 Posts

Yes, and this code is available to the Commercial License.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (10 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC