arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release
Performance Benchmark, etc.

Yesterday I ran a performance test, starting a crawl at http://msn.com with a Depth of 4, using 30 threads.  (A Depth of 4 means process all content and store all content from the first page, follow all links from that page and repeat for 4 more levels of pages)

In just 18 hours, arachnode.net has gathered 28GB of data and had crawled 225,000 complete WebPages.

This equates to around 3.25 complete pages processed and stored per second, continuously, using only a Gigabyte of RAM for Discovery caching.

There were approximately 800,000 CrawlRequests left to process when I stopped the Crawl.

As an interesting comparison, I started a Crawl using nutch using http://msn.com at a Depth of 4, and nutch returned 320 WebPages in about 5 minutes.  I know that nutch performs Url filtering, but I can't imagine that it would be able to filter 1,000,000 possible Uri's to 320 in just a few minutes... especiallty w/o actually crawling to find the Uri's.  It's as though nutch picked a very small subset of Uris to crawl from the first page and called it good.

On the flipside of this observation, perhaps I should extend arachnode.net's Uri filtering beyond Uris with named anchors and repeating patterns... (/images/image/images/, etc.)  Does arachnode.net need to crawl Uri's that have '?' in their Address?

As this post is tagged with 'performance' this is worth sharing:

(set the Recovery Model to 'Simple')


Posted Wed, Dec 31 2008 11:36 AM by arachnode.net

Comments

polfilm wrote re: Performance Benchmark, etc.
on Wed, Jan 7 2009 9:36 AM

I've just finished my 6 hour, depth 4, 30 threads run of 500 level 0 links and gathered 28,000 pages, 350,000 hyperlinks. Smoooooooking !!!

arachnode.net wrote re: Performance Benchmark, etc.
on Wed, Jan 7 2009 11:05 AM

The default installation of arachnode.net has the ManageLuceneDotNetIndexes.cs CrawlAction enabled.  Turn this off and you should see the collection rate triple, or thereabouts.

Of course, all of your pages won't be searchable from the Web project... :D (but an will crawl a lot faster...)

It looks like your crawl rate is about 1.29 pages a second.  Not too bad with Lucene running.  (This is the slowest part of the application...)

If you only crawl WebPages, turning off all parsing, you should see the WebPages Inserted/s Performance Counter burst above the number of threads once in a while.  If you were crawling only WebPages with 30 threads, you should expect to see around 18 WebPages inserted per second with an occasional burst to 45 per second or so.

I should also note for my tests that my test machine uses 15K U320 disks, which is the biggest limiting factor for performance.  ...Just two in a mirror though.

Thanks!

An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC