I have a very specific implementation of Arachnode.net in that I'm only using the crawler portion of the solution and I only need the data from the WebPages table. Given this highly specialized usage how would you suggest I tune / configure Arachnode for the best performance? Thanks.
These are my suggestions:
ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 512; //the higher, the better. just be sure that you don't run AN into virtual memory territory. allow room for SQL AND AN to use actual RAM.
ApplicationSettings.ExtractImageMetaData = false; //turn off if you aren't using.
ApplicationSettings.ExtractWebPageMetaData = false; //one of the slower functions. ensure this is turned off if not in use.
ApplicationSettings.HttpWebRequestRetries = 5; //the more times you attempt to retry pages that webserver won't serve you due to crawling to fast. higher settings = more accurate crawls, but at the expense of time.
Most likely, the bottleneck in your system is your disk. Stripes and/or SSDs are HIGHLY recommended. Of course, whether your disk can keep up depends on the options you set, whether you cache to disk or not, and how fast the site you crawl respond. Check the bytes downloaded per second against the transfer rate of your disk to get a better idea of your disk capacity.
The latest version of the code changed the way in which Discoveries and CrawlRequests and cached. Rather than cache to RAM and then to disk, with all cached items in RAM remaining permanent, Discoveries are now released from RAM when unused.
You will find the best performance if you can keep all of your Discoveries and CrawlRequests in RAM.
To do so, and replicate the previous version's caching, take advantage of these switches.
public
static int DiscoverySlidingExpirationInSeconds { get; set; } //higher values keep discoveries in RAM longer.
public static bool InsertDiscoveries { get; set; } //do, or do not cache discoveries to disk. you will need enough RAM to store your Discoveries or you may crawl in a loop. not (much) a problem if you crawl on a timed cycles.
public static bool InsertCrawlRequests { get; set; } //when you are out of RAM, cache CrawlRequests to disk, or not. if you don't care to cache CR's to disk, you can improve performance, but at the expense of crawl accuracy.
AN will only cache CrawlRequests to disk if less than 1/2 of the RAM requested is available. This can be changed in MemoryManager.
How does your config compare to the one above?
For best service when you require assistance:
Skype: arachnodedotnet
Thanks - I'll update and run with these and see how things go. My settings were different (as shown below) so I believe these will help a lot.
My Settings:
I'll run with these and keep an eye on disk performance...