arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Lost Crawl Requests?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 8 Replies | 3 Followers

Top 25 Contributor
27 Posts
egecko Crying [:'(] posted on Tue, Jun 9 2015 6:26 PM

So, here's the scenario.. We had AN crawling overnight and when I returned in the morning there were 9M crawl requests queued in the table as expected.  We have AN's memory bounds tightened down to a desired maximum memory set to 1 megabyte so this was not an unreasonable amount of requests to have queued in the table.  I stopped AN and ran the optimization sproc to help reduce fragmentation in the database.  After activating the AN service, it sucked in all 9M crawl requests, however I realized shortly after that that I needed to perform some other maintenance on the server and decided to stop AN shortly after start-up.  After AN stopped I checked the crawl requests table and expected to see the ~9M crawl requests dumped back to the database, however only ~3K requests ended up back in the table. :(  

Wish I could share more about what happened to those rows, but anything I could venture would be a complete guess and speculation.  I'll let you know more if I figure anything out, obviously it's not good to lose those crawl requests though. =\

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

I didn't keep track of how small of a memory footprint it could keep, but setting the desired max memory to 1 MB didn't have any noticible or significant negative effect in terms of its own operation.  It did basically offload the discovery verification to the database as you mentioned.  Part of the reason for doing this was to essentially get AN to maintain its state between runs.

If AN's caching isn't granting any benefit then you are still asking too much of the disk or your proxies are too slow or they have been flagged by the sites you are trying to crawl.  AN always writes the Discoveries to disk, but if it can't use cache RAM then it will ALWAYS try and read from the disk as well.  Also, CrawlRequests will always have to be read and deleted from the CrawlRequests table, where they could just be read from cache RAM.

We've since moved to evaluating its performance with the hyperlinks and hyperlink_discoveries turned off, and while it has significantly lightened the load on the SQL side, it does make AN a bit more fragile and if it has to stop/restart for some reason it does end up losing where its at in the crawl and then spends time crawling stuff it already did. =\

Not sure what 'fragile' means, technically speaking...  Big Smile

The Service resets itself, as previously mentioned...  as there is no chance for user input the AN Service operates on a cycle.  When it is stopped AN resets where it was as it must mean something needs to be changed with your crawling setup.  It's a Windows Service, just like SQL Server and is meant to stay running.


Once your config is solid there is no need for the Service to stop.  After the Service resets, write some code to give AN something else to do.  Feel free to comment out 'ResetCrawler'.  You will need to account for clearing the Discoveries at some point so don't forget to put this back once your crawling environment is where it needs to be.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 25 Contributor
27 Posts

By the way, the short period of time AN was running was less than 2 minutes so it's not really plausible that it ripped through that many requests while it was running.  

Top 10 Contributor
1,905 Posts

Why 1MB?  There is solid benefit to Discovery caching.  :)  If you don't want to cache CrawlRequests, set this:

OK, so, any exceptions in the Exceptions table?

Any rows in the DisallowedAbsoluteUris table?

Any rows in the Discoveries table before you started crawling?

Any plugins performing filtering?

Any other modifications you have made to the code?

Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
27 Posts
egecko replied on Tue, Jun 9 2015 11:28 PM

Why 1MB?  There is solid benefit to Discovery caching.  :)  

It's not so much that we do not want the benefit of discovery caching but rather a preference to persist the crawl requests to the database instead of keeping them in RAM until a graceful shutdown.  I haven't had a chance to tinker with the code that handles this yet but it's on the to-do list.  In the ideal world we'd have both. =]

OK, so, any exceptions in the Exceptions table?

Nope.

Any rows in the DisallowedAbsoluteUris table?

Nope, it's a solid zero (0). :)

Any rows in the Discoveries table before you started crawling?

Yes, I suppose a portion of the crawl requests could have been already discovered and then gotten disposed of that way (that'd be the other side of the double edged sword of setting the discovery RAM low?).

Any plugins performing filtering?

Nope.

Any other modifications you have made to the code?

Nope, not yet.  Just learning the internals now and working on tweaking the database performance.  We finally got everything moved to a crazy beefed up server and the SQL Server's logging delay minimized so that bottleneck is gone at least.  Still having some issues scaling it up, the best I've gotten it to perform is approximately 2.5 - 3.0 crawl requests processed per second. 

Top 10 Contributor
1,905 Posts
I tested that AN works with low RAM, perhaps down to 50MB, not sure if I tested it to 1MB. I think notepad.exe uses more than this. Wink I believe AN uses about 70MB in the out of the box configuration. If you don't give AN any RAM to use then every single operation has to be checked at the DB. I suggest giving AN at least 1GB. Not giving AN RAM to work with is a good way to slow crawling. If your CR's were deleted you probably had most/all of the CR's accounted for in the Discoveries table, due to stopping/starting/not letting the Service Stop process completely finish. The CR's were deleted from the DB, and the Service started crawling on the CrawlRequests.txt file. I have my machine crawling at 500 CrawlRequests/sec., inserting WebPages, HyperLinks, Images, EmailAddress, Files and all _Discoveries, managing Discoveries, reading from the DB for CRs, deleting from the DB for CRs. I am available for hourly contracting if you need direct help.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
23 Posts
JCrawl replied on Wed, Jun 10 2015 5:19 AM

Sure you do not have the resetdatabase option set.... would explain why the requests are lost when the crawler started back up

 

 

Top 25 Contributor
27 Posts
egecko replied on Wed, Jun 10 2015 7:33 PM

I didn't keep track of how small of a memory footprint it could keep, but setting the desired max memory to 1 MB didn't have any noticible or significant negative effect in terms of its own operation.  It did basically offload the discovery verification to the database as you mentioned.  Part of the reason for doing this was to essentially get AN to maintain its state between runs.

We've since moved to evaluating its performance with the hyperlinks and hyperlink_discoveries turned off, and while it has significantly lightened the load on the SQL side, it does make AN a bit more fragile and if it has to stop/restart for some reason it does end up losing where its at in the crawl and then spends time crawling stuff it already did. =\  

I'll let my boss know you're available for consulting and I'll try to minimize my questions or limit them to more bug-related things I come across. =D  Thank you very much for your time and help! 

Top 25 Contributor
27 Posts
egecko replied on Wed, Jun 10 2015 7:35 PM

JCrawl - Yep, made sure that was turned off. =D  I'm leaning more towards thinking they were just duplicates of stuff that was already discovered and not necessarily "valid" new urls to crawl.

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

I didn't keep track of how small of a memory footprint it could keep, but setting the desired max memory to 1 MB didn't have any noticible or significant negative effect in terms of its own operation.  It did basically offload the discovery verification to the database as you mentioned.  Part of the reason for doing this was to essentially get AN to maintain its state between runs.

If AN's caching isn't granting any benefit then you are still asking too much of the disk or your proxies are too slow or they have been flagged by the sites you are trying to crawl.  AN always writes the Discoveries to disk, but if it can't use cache RAM then it will ALWAYS try and read from the disk as well.  Also, CrawlRequests will always have to be read and deleted from the CrawlRequests table, where they could just be read from cache RAM.

We've since moved to evaluating its performance with the hyperlinks and hyperlink_discoveries turned off, and while it has significantly lightened the load on the SQL side, it does make AN a bit more fragile and if it has to stop/restart for some reason it does end up losing where its at in the crawl and then spends time crawling stuff it already did. =\

Not sure what 'fragile' means, technically speaking...  Big Smile

The Service resets itself, as previously mentioned...  as there is no chance for user input the AN Service operates on a cycle.  When it is stopped AN resets where it was as it must mean something needs to be changed with your crawling setup.  It's a Windows Service, just like SQL Server and is meant to stay running.


Once your config is solid there is no need for the Service to stop.  After the Service resets, write some code to give AN something else to do.  Feel free to comment out 'ResetCrawler'.  You will need to account for clearing the Discoveries at some point so don't forget to put this back once your crawling environment is where it needs to be.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (9 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC