arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Question on Commercial and developer license

rated by 0 users
Not Answered This post has 0 verified answers | 5 Replies | 2 Followers

Top 75 Contributor
5 Posts
munishgoyal1 posted on Thu, Sep 2 2010 12:03 PM

Hi Mike

If i want to modify the source code according to my needs, can i do that with developer/personal license ?

After development i want to use it for getting data which is consumed into my own application. The arachnode will run on 1 machine.

My application  has some paid users. So will i be able to do that with commercial license per machine ?

 

I have 1 requirement that to fetch pages at very high speed.

I have lot of CPU and network bandwidth. On single server how .

what limit of pages crawled per second i can expect with arachnode max. throughput ?

Can i disable indexing and other such limiting tasks, because i want to do my own type of processing after fetching the raw page data. Can i do that ? And after doing all sort of optimization how many pages/sec on average can i fetch ?

 

Thanks

Munish

 

 

All Replies

Top 75 Contributor
5 Posts

Typo: On a single server... how much network bandwidth , CPU and other resources can i expect to utilize with arachnode running to full capacity ?

Top 10 Contributor
1,905 Posts

You can modify the code to do pretty much whatever you want to do.

I have customers that crawl at a million pages an hour.  Typically, from everything turned off to everything turned on the rate is f(x) = 1z to f(x) = 200z.  So, up to 200 times the crawl rate with everything turned on to everything turned off.  There is a lot that AN can do, and of course, with AN performing more actions the crawl rate slows.  How fast you crawl depends on what you crawl, and what you do with the data once collected.

You can turn just about everything off that you do not need.

My crawling machine at home crawls at a rate of page per second, per thread, bursting up to two pages per second per thread.  This number can vary widely, as some customers can crawl at 277.78 pages per second, using cloud-based solutions.  My limit is due to my ISP.

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

You can max out your entire machine, if you wish to do so.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
5 Posts

Hi Mike,

You said ... "I have customers that crawl at a million pages an hour. "   -- Did you mean by using distributed crawl ?

With how much hardware can i achieve this scale ? Is the out-of-box AN solution further scalable to more  ? 

Or can code modifications be made easily  to make it further scalable ? 

As per my experience normally DNS lookup slows crawl rate. How is that handled in AN ?

Basically i am looking towards scale and optimized throughput, only for crawl at moment. Indexing etc. I can plan to do offline later.

 

based on this evaluation i plan to start with AN.

Thanks

Munish

 

Top 10 Contributor
1,905 Posts

The customers that I support that achieve this crawl rate use Amazon's and Rackspace's cloud servers.  They split the crawler instances from the database and allocate a large chunk of RAM to the SQL machine, which serves as the primary Discovery cache.

AN itself is extremely efficient and IS limited by the hardware.  You can dial/scale AN far beyond what your network/disk/processor/memory resources will support.

It's tough for me to say how much hardware you will need because I don't know which settings you will need.  In terms of performance, with all settings turned on, AN crawls at a rate of 10x.  With all settings turned off, AN crawls at a rate of 100x.

Scalability: There are multiple levels of caching in place with AN.

1.) Each crawl instance maintains a MRU cache.

2.) SQL Server maintains a cache (Buffer Manager) that primarily serves Discoveries.  (The pieces of information that determine where AN has been and where it needs to go.)

The order of caching behavior proceeds as follows: If a crawler instance cannot find a Discovery in the local MRU cache, then the crawler instance checks the database.  If the Discovery is not found in the DB, the Discovery request is recorded for future benefit of itself and other connected crawlers.  As all crawlers may report to the same database, if a Discovery cannot be found in the local MRU cache then it will be retrieved from the SQL Server Buffer Manager first (assuming it is located there) or from the HDD's disk cache (assuming it is localed there) or finally, from the Discoveries table itself.  Thus, it is important to ensure that your SQL server has adequate RAM.  This can be measured by the 'Buffer Manager Cache Hit Ratio' performance counter.  In my crawling configurations I typically see a 99% hit ratio, which means that only 1% of the time do I need to retrieve the record from disk when a Discovery is asked for.  Of course, this assumes adequate RAM for SQL, and depends on the crawl load.

For those interested in the DNS lookup response, see here: http://arachnode.net/forums/t/1461.aspx

These two videos are worth watching:

watch?v=GYnMihdtOmw

watch?v=aLeXrMUbWG4

This is worth reading:

http://arachnode.net/blogs/arachnode_net/archive/2010/04/29/troubleshooting-crawl-result-differences-between-different-crawl-environments.aspx

To BEST (albeit not concretely) answer your question about 'how much hardware do I need': I can crawl 1,000,000 WebPages served by Test.aspx site in the Web project with all AN settings enabled in just under 24 hours.  Of course, doing so removes the limitations imposed by my ISP and by other webservers and networks.  My hardware is a W500 Lenovo laptop with Windows 7 Ultimate and 4GB of DDR3 RAM with a secondary HDD at 7200 RPM serving the database.  Determine your crawl rate for the Test.aspx scenario and factor your results against your proposed hardware/network solution/implementation.

Finally: AN was created as a general purpose crawler, with many scenarios in mind.  The large scale customers I support have required code modifications, and by their grace their modifications have been included in the core AN product, and thus, AN has retained the 'general purpose' nature I have strived to create.  If you are intent on large scale crawling, as I believe you are, let us continue this conversation as your endeavors do merit further discussion...  From my experience, the discussion on 'how fast can I go' cannot be answered in a forum post or two.  I need to know what you plan to crawl, how you plan to crawl, and where you plan to crawl from.  Smile

Sincerely,
Mike Anderson, arachnode.net

FWIW: When I elected to promote AN to a first-class project, it was as a result of dissatisfaction with NUTCH.  I determined that result for result, I could crawl 2-3 times faster than the default NUTCH implementation.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (6 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC