arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Making the crawler work faster

rated by 0 users
Answered (Not Verified) This post has 0 verified answers | 26 Replies | 2 Followers

Top 25 Contributor
25 Posts
samarhtc posted on Wed, Nov 21 2012 9:12 AM

Hi

I'm in a situation where I have to crawl some very large websites. The resulting db is apx 120 gb. This is actually ok, but the time it takes to crawl one website is way to long. I'm crawling with 100 thread, but still it takes days. 

So I was wondering how to minimize the crawl time. What I need from the crawling is the webpages that is raw data only meaning html code only, no images, or anything else. 

I was thinking if I somehow disable the encoding of the Source column within the webpages table, maybe this would speed up the process, since now AN does not have to encode/decode the webpages it simply insert the html inside the source column (or a new column next to it source_2!) in nvarchar format.

what do you think, any suggestion would be appreciated.

 

 

All Replies

Top 10 Contributor
1,905 Posts

This is an involved topic.  First thing I would suggest is to NOT insert the source into the DB.  Save it to disk instead.  MUCH FASTER.

Take a look at dbo.arachnode_usp_arachnode.net_OPTIMIZE to get a better idea of how fragmented the DB is.  You can turn off indexes you don't need as well.

If you don't care about the encoding, yes, simply comment out the EncodingManager.cs portions where the encoding is determined and assign the DecodedHtml to Encoding.Default.GetString(Data).

How long would downloading 120GB take on your connection now?

You may be throttled by your ISP, or the site itself.

I'm headed out for the weekend, but I can reply later this evening.

Send me a private email and I will take a look at the actual site.  We can benchmark now long it takes to crawl, say, at a depth of three.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts

Hi Mike

Thanks a lot for your answer. 

Let me try with your suggestion regarding EncodingManager first maybe that would be enough.

Sincerely

 

All Replies

Top 10 Contributor
1,905 Posts

This is an involved topic.  First thing I would suggest is to NOT insert the source into the DB.  Save it to disk instead.  MUCH FASTER.

Take a look at dbo.arachnode_usp_arachnode.net_OPTIMIZE to get a better idea of how fragmented the DB is.  You can turn off indexes you don't need as well.

If you don't care about the encoding, yes, simply comment out the EncodingManager.cs portions where the encoding is determined and assign the DecodedHtml to Encoding.Default.GetString(Data).

How long would downloading 120GB take on your connection now?

You may be throttled by your ISP, or the site itself.

I'm headed out for the weekend, but I can reply later this evening.

Send me a private email and I will take a look at the actual site.  We can benchmark now long it takes to crawl, say, at a depth of three.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts

Hi Mike

Thanks a lot for your answer. 

Let me try with your suggestion regarding EncodingManager first maybe that would be enough.

Sincerely

 

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC