arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Performance tuning the crawler

rated by 0 users
Answered (Verified) This post has 1 verified answer | 2 Replies | 2 Followers

Top 150 Contributor
2 Posts
Nemesis posted on Tue, Sep 14 2010 3:05 PM

I have a very specific implementation of Arachnode.net in that I'm only using the crawler portion of the solution and I only need the data from the WebPages table.  Given this highly specialized usage how would you suggest I tune / configure Arachnode for the best performance?  Thanks.

Answered (Verified) Verified Answer

Top 10 Contributor
1,692 Posts
Verified by arachnode.net

These are my suggestions:

 

 

 

 

ApplicationSettings.AssignEmailAddressDiscoveries = false; //turn off.  the slowest of the regular expressions.

 

 

ApplicationSettings.AssignFileAndImageDiscoveries = false; //turn off if you aren't using.

 

 

ApplicationSettings.AssignHyperLinkDiscoveries = true;

 

 

ApplicationSettings.ClassifyAbsoluteUris = false; //turn off if you don not plan to use the reporting stored procedures.

 

 

 

 

ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 512; //the higher, the better.  just be sure that you don't run AN into virtual memory territory.  allow room for SQL AND AN to use actual RAM.

 

 

 

 

ApplicationSettings.EnableConsoleOutput = false; //as with all console output, this slows crawling.

 

 

ApplicationSettings.ExtractImageMetaData = false; //turn off if you aren't using.

ApplicationSettings.ExtractWebPageMetaData = false; //one of the slower functions.  ensure this is turned off if not in use.

 

 

 

 

 

 

 

 

 

 

 

 

ApplicationSettings.HttpWebRequestRetries = 5; //the more times you attempt to retry pages that webserver won't serve you due to crawling to fast.  higher settings =  more accurate crawls, but at the expense of time.

 

 

ApplicationSettings.InsertDisallowedAbsoluteUriDiscoveries = false; //once you are comfortable with your crawl routine, you no longer need to log exceptions.  you need to be extremely comfortable with the data you collect, and the data you expect to collect to turn this off, else your crawl may be confusing to you.

 

 

ApplicationSettings.InsertDisallowedAbsoluteUris = true; //this isn't needed unless you plan to execute the reporting stored procedures.

 

 

 

 

ApplicationSettings.InsertExceptions = true; //once you are comfortable with your crawl routine, you no longer need to log exceptions.  you need to be extremely comfortable with the data you collect, and the data you expect to collect to turn this off, else your crawl may be confusing to you.

 

 

 

 

 

 

ApplicationSettings.MaximumNumberOfCrawlThreads = 10; //as many as your DISK can handle.

 

 

 

 

ApplicationSettings.OutputConsoleToLogs = false;

 

Most likely, the bottleneck in your system is your disk.  Stripes and/or SSDs are HIGHLY recommended.  Of course, whether your disk can keep up depends on the options you set, whether you cache to disk or not, and how fast the site you crawl respond.  Check the bytes downloaded per second against the transfer rate of your disk to get a better idea of your disk capacity.

The latest version of the code changed the way in which Discoveries and CrawlRequests and cached.  Rather than cache to RAM and then to disk, with all cached items in RAM remaining permanent, Discoveries are now released from RAM when unused.

You will find the best performance if you can keep all of your Discoveries and CrawlRequests in RAM.

To do so, and replicate the previous version's caching, take advantage of these switches.

 

 

public

 

static int DiscoverySlidingExpirationInSeconds { get; set; } //higher values keep discoveries in RAM longer.

 

 

public static bool InsertDiscoveries { get; set; } //do, or do not cache discoveries to disk.  you will need enough RAM to store your Discoveries or you may crawl in a loop.  not (much) a problem if you crawl on a timed cycles.

 

 

public static bool InsertCrawlRequests { get; set; } //when you are out of RAM, cache CrawlRequests to disk, or not.  if you don't care to cache CR's to disk, you can improve performance, but at the expense of crawl accuracy.

AN will only cache CrawlRequests to disk if less than 1/2 of the RAM requested is available.  This can be changed in MemoryManager.

How does your config compare to the one above?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,692 Posts
Verified by arachnode.net

These are my suggestions:

 

 

 

 

ApplicationSettings.AssignEmailAddressDiscoveries = false; //turn off.  the slowest of the regular expressions.

 

 

ApplicationSettings.AssignFileAndImageDiscoveries = false; //turn off if you aren't using.

 

 

ApplicationSettings.AssignHyperLinkDiscoveries = true;

 

 

ApplicationSettings.ClassifyAbsoluteUris = false; //turn off if you don not plan to use the reporting stored procedures.

 

 

 

 

ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 512; //the higher, the better.  just be sure that you don't run AN into virtual memory territory.  allow room for SQL AND AN to use actual RAM.

 

 

 

 

ApplicationSettings.EnableConsoleOutput = false; //as with all console output, this slows crawling.

 

 

ApplicationSettings.ExtractImageMetaData = false; //turn off if you aren't using.

ApplicationSettings.ExtractWebPageMetaData = false; //one of the slower functions.  ensure this is turned off if not in use.

 

 

 

 

 

 

 

 

 

 

 

 

ApplicationSettings.HttpWebRequestRetries = 5; //the more times you attempt to retry pages that webserver won't serve you due to crawling to fast.  higher settings =  more accurate crawls, but at the expense of time.

 

 

ApplicationSettings.InsertDisallowedAbsoluteUriDiscoveries = false; //once you are comfortable with your crawl routine, you no longer need to log exceptions.  you need to be extremely comfortable with the data you collect, and the data you expect to collect to turn this off, else your crawl may be confusing to you.

 

 

ApplicationSettings.InsertDisallowedAbsoluteUris = true; //this isn't needed unless you plan to execute the reporting stored procedures.

 

 

 

 

ApplicationSettings.InsertExceptions = true; //once you are comfortable with your crawl routine, you no longer need to log exceptions.  you need to be extremely comfortable with the data you collect, and the data you expect to collect to turn this off, else your crawl may be confusing to you.

 

 

 

 

 

 

ApplicationSettings.MaximumNumberOfCrawlThreads = 10; //as many as your DISK can handle.

 

 

 

 

ApplicationSettings.OutputConsoleToLogs = false;

 

Most likely, the bottleneck in your system is your disk.  Stripes and/or SSDs are HIGHLY recommended.  Of course, whether your disk can keep up depends on the options you set, whether you cache to disk or not, and how fast the site you crawl respond.  Check the bytes downloaded per second against the transfer rate of your disk to get a better idea of your disk capacity.

The latest version of the code changed the way in which Discoveries and CrawlRequests and cached.  Rather than cache to RAM and then to disk, with all cached items in RAM remaining permanent, Discoveries are now released from RAM when unused.

You will find the best performance if you can keep all of your Discoveries and CrawlRequests in RAM.

To do so, and replicate the previous version's caching, take advantage of these switches.

 

 

public

 

static int DiscoverySlidingExpirationInSeconds { get; set; } //higher values keep discoveries in RAM longer.

 

 

public static bool InsertDiscoveries { get; set; } //do, or do not cache discoveries to disk.  you will need enough RAM to store your Discoveries or you may crawl in a loop.  not (much) a problem if you crawl on a timed cycles.

 

 

public static bool InsertCrawlRequests { get; set; } //when you are out of RAM, cache CrawlRequests to disk, or not.  if you don't care to cache CR's to disk, you can improve performance, but at the expense of crawl accuracy.

AN will only cache CrawlRequests to disk if less than 1/2 of the RAM requested is available.  This can be changed in MemoryManager.

How does your config compare to the one above?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts
Nemesis replied on Wed, Sep 15 2010 12:02 AM

Thanks - I'll update and run with these and see how things go.  My settings were different (as shown below) so I believe these will help a lot.

My Settings:

  • AssignEmailAddressDiscoveries:  true  -  changed to false
  • AssignFileandImageDiscoveries:  true  -  changed to false
  • AssignHyperLinkDiscoveies:  true
  • ClassifyAbsoluteURIs: true  -  changed to false
  • DesiredMaximumMemoryUsageInMegabytes:  1024  -  not running into virtual memory territory yet
  • EnableConsoleOutput:  false
  • ExtractImageMetatData:  false
  • ExtractWebPageMetaData:  false
  • HttpWebRequestRetries:  5
  • InsertDisallowedAbsoluteUriDiscoveries:  true  -  will change to false in testing
  • InsertDisallowedAbsoluteUris:  true  -  will change to false in testing
  • InsertExceptions:  true  -  will change to false in testing
  • MaximumNumberOfCrawlThreads:  10
  • OutputConsoleToLogs:  false

I'll run with these and keep an eye on disk performance... 

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC