arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Limiting the number of concurrent requests per domain

rated by 0 users
Answered (Not Verified) This post has 0 verified answers | 10 Replies | 2 Followers

Top 75 Contributor
5 Posts
Alan posted on Wed, May 9 2012 11:44 AM

Just want to verify that this is correct:

Politeness.MaximumActiveHttpWebRequests is the value that specifies how many concurrent requests can be made to a host.

To enable this limiting, you have to enable the Concurrency crawl rule with a priority of 1 and set to a CrawlRuleType of 1 (PreRequest) and that in turn just sets the Politeness.MaximumActiveHttpWebRequests property to the number that it's hard coded to which is defaulted to 3.

The only way to set the max active requests number is to change the hard coded value in the Concurrency class.

Thanks!

EDIT:

Not sure if this is working. I've got 5 sites I want to crawl. I've added them to the crawl requests programatically. When I run the spider with 10 threads, I see all 10 crawling pages from the same domain, however, only 4 are triggering the CrawlRequestThrottled event. The other 6 are making requests. I set the MaximumActiveHttpWebRequests to 3 so not sure why it has 6 threads making requests on the same domain.

Also... What's the best way to make AN only tie up as many threads on any one domain as I've specified in MaximumActiveHttpWebRequests? What I'm seeing is that with MaximumActiveHttpWebRequests  set at 3 and 10 threads going, more than 3 will either be making requests to the same domain or trying to but being throttled.

So if all 10 threads are trying to hit the same domain and the max is 3 then those other 7 are being wasted waiting on the throttle timeout.

All Replies

Top 10 Contributor
1,694 Posts

Sorry.  Didn't see this one until now...  I will answer in a few hours.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

Hmm... trying to repro.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

What method are you using to determine which thread is accessing which domain?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
5 Posts
Alan replied on Thu, May 10 2012 5:23 PM

I'm using a modified version of the GraphicalUserInterface project so I can see what each thread has done.

Now that I've looked at the code more, I see the Concurrency class only limits the total number of active requests across all threads. I had originally thought it was limiting on a per host basis. I'll make my own version of that class to limit by host.

It would be cool if there was a way to set thread affinity to a particular host (and other parameters perhaps) so that a thread would only select URLs to request for that host or if there were none for that host, it would revert back to processing URLs from any host.

That way if you've reached your daily limit for requests to a particular host, you wouldn't have threads continue to process through URLs for it, only to see it's disallowed and add it back to the queue.

Is there a way I can make a plugin that overrides how AN queries the database to get the URLs to request for a thread or would that require changing the core code?

Top 10 Contributor
1,694 Posts

I'll take a look at the concurrency code for the issue you brought up.  It SHOULD be per domain...  [EDIT: I think I see the bug...]

You can change this.  Set the following to false:

                    ApplicationSettings.CreateCrawlRequestsFromDatabaseCrawlRequests = true;

                    ApplicationSettings.CreateCrawlRequestsFromDatabaseFiles = false;

                    ApplicationSettings.CreateCrawlRequestsFromDatabaseHyperLinks = false;

                    ApplicationSettings.CreateCrawlRequestsFromDatabaseImages = false;

                    ApplicationSettings.CreateCrawlRequestsFromDatabaseWebPages = false;

Then, use an EngineAction to populate the Engine however you'd like.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

BTW, the Politeness object is per-Host.  You can change specific values in the Concurrency.cs Plugin.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

OK, you've found a good one.  It's funny how I will code a feature that works perfectly for one customer, and another will come along, use the feature in a slightly different way and discover a bug...

OK, you were probably seeing a few things.

1.) The Politeness object wasn't being set for CrawlRequests created to satisfy all of the Discoveries in a WebPage.  So, an Image request for a WebPage didn't count against the maximum concurrent counts.  Once a Crawl thread asked for additional CrawlRequests, then the throttling would be correct again.  I moved code into the PolitenessManager.cs and this fixes this issue.

2.) The Politeness object is PER HOST, not per DOMAIN.  So, if you are crawling tmz.com, media.tmz won't be included in the tmz.com counts.  Look in PolitenessManager.cs :: ManagePolitenesses(...); (first lines...)

Both items are fixed and checked in.

I added a mod to the GUI that allows better visualization of the the throttling.  Create on crawl request for one domain, then :30 later, create 4 additional CR's for 4 separate domains.  Modify PolitenessManager.cs to use the DOMAIN for the Politeness key.  Set the Max Concurrent to '1'.  You'll see one of the 5 threads chewing on the first domain, and throttling will occur...  then, about a minute later you'll see the other threads pick up the additional domains.  (the Engine round-robins based on priority across all domains...)

Look at the 'CrawlRequests Processed/s' performance counter.  The values here should be roughly equal for with and without throttling.

Thanks,
Mike 

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

The affinity is effectively achieved (or the same net effect) by the shuffling that the engine does.  Each thread should receive a near equal sampling of domains CR counts.

This way, when you aren't throttling, one thread doesn't choke up on one slow domain.  When other threads run out of things to do, they will help each other and crawl what the other threads can't get to.

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
5 Posts
Alan replied on Fri, May 11 2012 3:59 PM

Awesome, thanks for the detailed response and the fix. I'll grab the update now and check it out. Appreciate the great support!

Top 10 Contributor
1,694 Posts

Of course!

It's what I am here for.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (11 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC