arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Console disappear/shutdown - memory issues

rated by 0 users
Answered (Verified) This post has 1 verified answer | 29 Replies | 3 Followers

Top 10 Contributor
229 Posts
megetron posted on Mon, Aug 17 2009 7:41 AM

Hello,

I am running AN console on several sites. now, the memory grows and so is te VMsize the application is using. CPU is about ~40% and so I keep running the AN,

few hours when I come back to see the progress I don't see the console anymore, it shuts down without any messages to view. I can only assume this is a memoty leak usage and so windows shuts the application down automatically.

Does anyone run through such behaviour ar this is just me?

Thanks.

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

You won't believe this:

        internal void ProcessCrawlRequest(CrawlRequest crawlRequest, bool obeyCrawlRules, bool executeCrawlActions)
        {
            if (crawlRequest.Discovery.Uri.AbsolutePath != "/robots.txt")
            {
                crawlRequest.WebClient.Method = "HEAD";
            }
            else
            {
                //always get the robots.txt file as not all webservers support the head method.
                crawlRequest.WebClient.Method = "GET";
            }
            crawlRequest.WebClient.Method = "GET";
            try
            {
                crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri);
            }
            catch (WebException webException)
            {
                if (crawlRequest.WebClient.WebResponse == null || ((HttpWebResponse) crawlRequest.WebClient.WebResponse).StatusCode != HttpStatusCode.MethodNotAllowed)
                {
                    throw new WebException(webException.Message, crawlRequest.WebClient.WebException);
                }
            }

If you remove the line in bold, and let the WebClient make a HEAD request, .NET leaks memory all over the place.  Great!!! (not!!!)  Now I get to figure out why...

...so, I abandoned hope of figuring out why the .net WebClient leaks memory when you issue a HEAD request, and now managed the steams myself.

Hopefully I can get everything checked in this weekend.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
229 Posts

NOTE:  I am running AN with windows 2003 server.

arachnode.net:

It sounds like you are being throttled - the site has decided to limit the number of requests you can make.

Look at the Frequency.cs rule.

Mike, I am not being throttled. I am testing the 1,3 beta version - it doesnt happenning with 1.2 release version.
just to be 100% positive I can give a try to the frequency rule, but I doubt it.

BTW, when crawl 1.2 version withdepth of 4, seems like the RAM rising to very high levels too. so is the VM size. ad s, again, I get windows warnning "Virtual Memory too low".

Something went wrong due to your last changes, Ijust don't know to say exactly.

another issue might intresting you. you create the counters for the output consolde in the new build, there you divided in some parameter. when Idebug the code, I could notice that sometimes the parameter is ZERO, and so, you divided by zero.
Don't know if this is got todo with that specifi post, but just point it out.

 

 

arachnode.net:

Intresting. I will make some research on how to improve t.
BTW, What are the minimal hardware minimal requierments for AN?

 

Top 10 Contributor
1,905 Posts

What specific site is giving you problems?

There aren't any specific HW requirements for AN since you can dial it down to use next to nothing of the host machine.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

arachnode.net:

What specific site is giving you problems?

all of the websites that takes too long to test(sratim.co.il is one of them, depth 4). could it be an issue with windows 2003 server? do you have OS such? can you make a test on this OS ?

 

Top 10 Contributor
1,905 Posts

I do have this OS.

I am super busy with my work week - but I will be working on AN on Saturday.  We can IM then to figure out what is going on...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

I have tracked down what the issue is regarding memory consumption.

There is a piece of code in CrawlRequestManager.cs that manages File and Image discoveries such that if a File or an Image is discovered by another thread, but the DiscoveryType isn't assigned (== DiscoveryType.None in this bug), then the Discovery is placed into the CrawlRequest queue for that Crawl with the inevitable result that the next time the Discovery is dequeued for Crawling, the other Crawl that also discovered the Discovery will have assigned the DiscoveryType and thus can be immediately submitted to the database instead of re-crawling the same Discovery.

What was missing was a check for 'IsDisallowed', only asking "Is this DiscoveryType == DiscoveryType.None?"  A Discovery can have a DiscoveryType == DiscoveryType.None if no DataType is specified in cfg.AllowedDataTypes.  Here's the important part: As each Discovery contains a reference to the CrawlRequest which Discovered it, all pages that contain a Disallowed Discovery were being kept in RAM.  Adding a check for 'IsDisallowed' fixes this memory consumption problem.

Thanks for your patience!  :)


Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

ok good!

Today crawled with 1.2 version using the INSERTHYPERLINKSTPDATABASE and createcrawlrequestfromhyperlynks database.

Top 10 Contributor
1,905 Posts

AN needs more RAM for your crawls.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Aug 31 2009 11:49 AM

Nice catch on the memory issue Mike :)

 

Top 10 Contributor
1,905 Posts

:)  This one was all Megetron  :)

I'm on week 2 of a two week sprint at work but am working hard on finishing up the last of the improvements/bug fixes to the caching code.  Then, we should be at Version 1.3.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Just a quick update here... MEMORY USAGE HAS BEEN DRASTICALLY (DRASTICALLY FOR THE BETTER) IMPROVED FOR VERSION 1.3!!!

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

arachnode.net:

Just a quick update here... MEMORY USAGE HAS BEEN DRASTICALLY (DRASTICALLY FOR THE BETTER) IMPROVED FOR VERSION 1.3!!!

 

SO HAPPY TO HEAR THAT :)

waiting for the new version to test.

Top 10 Contributor
1,905 Posts

It's coming - I'm working on a test site so I can verify that I didn't break anything.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

You won't believe this:

        internal void ProcessCrawlRequest(CrawlRequest crawlRequest, bool obeyCrawlRules, bool executeCrawlActions)
        {
            if (crawlRequest.Discovery.Uri.AbsolutePath != "/robots.txt")
            {
                crawlRequest.WebClient.Method = "HEAD";
            }
            else
            {
                //always get the robots.txt file as not all webservers support the head method.
                crawlRequest.WebClient.Method = "GET";
            }
            crawlRequest.WebClient.Method = "GET";
            try
            {
                crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri);
            }
            catch (WebException webException)
            {
                if (crawlRequest.WebClient.WebResponse == null || ((HttpWebResponse) crawlRequest.WebClient.WebResponse).StatusCode != HttpStatusCode.MethodNotAllowed)
                {
                    throw new WebException(webException.Message, crawlRequest.WebClient.WebException);
                }
            }

If you remove the line in bold, and let the WebClient make a HEAD request, .NET leaks memory all over the place.  Great!!! (not!!!)  Now I get to figure out why...

...so, I abandoned hope of figuring out why the .net WebClient leaks memory when you issue a HEAD request, and now managed the steams myself.

Hopefully I can get everything checked in this weekend.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Sat, Sep 12 2009 10:46 AM

Bummer!  How did you ever track the memory leak down to the GET request method?!

 

Top 10 Contributor
1,905 Posts

It's actually the HEAD method.  It took quite a while since the ANTS profiler wasn't showing me anything out of the ordinary. I basically had to call GC.Collect() and return function by funciton, line by line until I found it.  Fun!!!  (not)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 2 of 2 (30 items) < Previous 1 2 | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC