arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Console disappear/shutdown - memory issues

rated by 0 users
Answered (Verified) This post has 1 verified answer | 29 Replies | 3 Followers

Top 10 Contributor
229 Posts
megetron posted on Mon, Aug 17 2009 7:41 AM

Hello,

I am running AN console on several sites. now, the memory grows and so is te VMsize the application is using. CPU is about ~40% and so I keep running the AN,

few hours when I come back to see the progress I don't see the console anymore, it shuts down without any messages to view. I can only assume this is a memoty leak usage and so windows shuts the application down automatically.

Does anyone run through such behaviour ar this is just me?

Thanks.

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

You won't believe this:

        internal void ProcessCrawlRequest(CrawlRequest crawlRequest, bool obeyCrawlRules, bool executeCrawlActions)
        {
            if (crawlRequest.Discovery.Uri.AbsolutePath != "/robots.txt")
            {
                crawlRequest.WebClient.Method = "HEAD";
            }
            else
            {
                //always get the robots.txt file as not all webservers support the head method.
                crawlRequest.WebClient.Method = "GET";
            }
            crawlRequest.WebClient.Method = "GET";
            try
            {
                crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri);
            }
            catch (WebException webException)
            {
                if (crawlRequest.WebClient.WebResponse == null || ((HttpWebResponse) crawlRequest.WebClient.WebResponse).StatusCode != HttpStatusCode.MethodNotAllowed)
                {
                    throw new WebException(webException.Message, crawlRequest.WebClient.WebException);
                }
            }

If you remove the line in bold, and let the WebClient make a HEAD request, .NET leaks memory all over the place.  Great!!! (not!!!)  Now I get to figure out why...

...so, I abandoned hope of figuring out why the .net WebClient leaks memory when you issue a HEAD request, and now managed the steams myself.

Hopefully I can get everything checked in this weekend.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

You will see AN use the memory up to the setting in cfg.Configuration 'DesiredMaximumMemoryUsageInMegabytes" due to caching.  The default is 1GB.

If you have a VM with 2GB of RAM, then it will look like AN is trying to use all of your RAM.

Set the ''DesiredMaximumMemoryUsageInMegabytes' to a lower number, lower than the amount of RAM you have available and let me know what happens.

Also, are you running it via the application or from Visual Studio?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts
megetron replied on Mon, Aug 17 2009 12:37 PM

:)

I am running bia visual studio in release mode. I am going to run it now with new setting and let you know.

Top 10 Contributor
1,905 Posts

megetron:

:)

I am running bia visual studio in release mode. I am going to run it now with new setting and let you know.

Ouch!  I have never seen this!  Let me know what the repro is!

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

ok, after disable most of configuration properties, to accelerate things up, and after limitthe parameter you post to 512 which is very low, I still get the error and AN console disappear. you can try this yourself on seretv.net domain.

to accelerate the runtime libarary exception disable some of the parameters on configurations (like we did yesterday) and you will get the error. be patient it will come.

Top 10 Contributor
1,905 Posts

Your repro steps aren't specific enough.  Sad

I crawled this site in its entirety yesterday without issue.

Try repairing your installation of VS?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

I did some investigating and have found a few small leaks.  :)

Working on them right now.

1.) Fixed a small leak in PriorityQueue.cs where the Last reference wasn't being removed.

2.) The Engine no longer recycles the Threads/Crawls if you stop and start the Engine.

3.) For CrawlRequest.Parent, the Parent property will no longer extend up generation after generation.  Only Parent references are maintained, not Grandparent references.

So, these are all great fixes that improve crawling in general.  However, there is something that should be noted about what it means to crawl at a Depth of int.Max, which is how this thread came to be.

Let's say you are crawling a WebSite that has 20 HyperLinks to 20 other distinct WebPages for each WebPage and the site has an infinite depth.

If you start a crawl and specify a depth of int.Max, you'll crawl the first page and create 20 additional CrawlRequests.  Those 20 CrawlRequests will have a reference back to the initial CrawlRequest at Depth == 1.  Each of the CrawlRequests at Depth == 2 will have a reference back to the CrawlRequests at

...to be continued.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts
megetron replied on Sun, Aug 23 2009 10:34 AM

ok, I am going to test it without int.max, and check how it will act.

for the last 2 days, I am trying to crawl a website using infinate depath and the RAM exceeds the limit I specify on the configuration table. further more, the VM size seems to be increased and I get "windows is running out of virtual memory" message. so, I wonder what is the reason that ANallows exceed the limits of it.

 

Top 10 Contributor
1,905 Posts

The MaximumMemory setting is desired, but not guaranteed.  I have a few fixes that I need to check in.

For the time being, try crawling at a Depth of 3-4 and then get more CrawlRequests from the WebPages table.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

update for this post status: Mike uploaded new fixes for this issue.

Tested this, and I don't get anymore windows notification errors....seems like the memory virtual memory and also the RAM are stable.

BUT now there is a new behaviour (I think that I see this behaviour even before the new build).

Now the console just stop in the middle of the crawl...it doesn't exist with error and there are no progress with the crawl. on the task manage CPU is 0 and the RAM is still being used.

on the regular debug tables (exception, disallowedabsoluteuris) nothing speciel...only some 404 errors.

I noticed another intresting issue that never happens before...the internet connection is connected but when openning a new browser no response from the server for websites. even msn messenger fails to connect.
it happend more then 3 times when crawling AN.

I know there are not enough details, but this is what I got so far.

please let me know if I can assist more.

NOTE: this new thing has nothing with depth. it happens with depth 4.
Top 10 Contributor
1,905 Posts

Is your internet connection being throttled?  I have been throttled by COMCAST before.  When the Crawl appears to stop, pause debugging and switch to the Crawl threads.  What is happening?  It could be that your internet provider is stalling packet delivery.

It could be this: In order to fix the 406 errors for .gifs I had to remove the ACCEPT header functionality (it wasn't working properly with today's webservers - go figure) ... perhaps you are downloading large movies and this is why the crawl appears to hang.  I should have time later in the week to finish the AllowedDataTypes functionality.

How is SQL feeling?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Hello Mike, hope you feel better

dont know if being throttled...needs to check on it with the provider?

what do you mean by switch to Crawl thread?  please instruct in details so I can excecute.
I do not download files at all...the plugin is off.

 

Cheers!

Top 10 Contributor
1,905 Posts

Thanks!  I do already!

In VS you can switch to a specific thread.

Makes multi-threading debugging much easier.

Your disk subsystem could be falling behind?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

I checked on the threads window and seems like thereare no threads at all. the window is empty.

On exception table I see hundred of same messages: Unable to connect to the remote server

   at System.Net.HttpWebRequest.GetResponse()     at System.Net.WebClient.GetWebResponse(WebRequest request)     at Arachnode.SiteCrawler.Components.WebClient.GetWebResponse(WebRequest request) in E:\DEVELOPMENT\yeshira\SiteCrawler\Components\WebClient.cs:line 163

 

seems like all of the traffic stuck on the computer. it never happens before,but maybe it never happens because I recieve windows error notifications and console was out....

I am all confuse here. cannot complete a crawl of only one site.

QUOTE: Your disk subsystem could be falling behind?

What disk subsystem  is and how can I check this?

Top 10 Contributor
1,905 Posts

It sounds like you are being throttled - the site has decided to limit the number of requests you can make.

Look at the Frequency.cs rule.

This is a good doc to read: http://www.microsoft.com/whdc/archive/subsys_perf.mspx

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (30 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC