arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

The remote server returned an error: (404) Not Found

rated by 0 users
Not Answered This post has 0 verified answers | 21 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Fri, May 22 2009 8:32 AM

Hello,

I wonder. Is there any option ignore the HTTP errors (like the 404 http error) and continues the crawling?
I think that in case there is 404 error you can always retry to request page and if the response still different then 200 OK response you can always go to the referer page.
This option exist?

The error I am getting from time to time is this on webclient.cs ":

throw

 

new WebException(webException.Message, webException);

 

 

 

+  $exception {"The remote server returned an error: (404) Not Found."} System.Exception {System.Net.WebException}

All Replies

Top 10 Contributor
1,905 Posts

This option does not exist and may not be what you really want to implement.  Check the volume/number of 404 errors in the Exceptions table...

If a site doesn't respond then it 1.) Probably isn't going to or 2.) Probably isn't going to.  :)

If you were to retry again then you would slow down the crawling process significantly, and the most likely result is that <1/100th% of the sites that returned a 404 would return a 200 on the next request.

Think it about it like this: When browsing, how many sites in the last year have responded after you refreshed?  I can't think of any from my end.  And, if they do respond after a refresh, how many of those are actually worth crawling?

Do you have a good case for trying again after a 404 is encountered?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Sorry for the late response.

Yes there is. when you browsing several sites only, it is significant to try repeatedly and wait to site to come back online if it is down.

further more, I am keep getting 404 on sites that works for me on IE, FireFox and etc...so, I am not sure, but can be a problem with the request of arachnode.net?

I will keep testing it, and let you know if I have any news on that.

Thank Mike.

Top 10 Contributor
1,905 Posts

What are some sites that generate 404s?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

something is wrong. because when I try from my location and my IP browse the sites that are offline, it is truely online. I guess that 404 error depends on your location. I am from Israel, so I don't know if it has something to with it.

I will collect later a list of sites that appears to be offline which are actually online, and you can test them from your location.

I hope to hear the answer after your cool vacation. enjoy there.

Top 10 Contributor
229 Posts

http://movinsane.com/
http://movin.co.il/

Here is another 2 sites I am having the same problem. the secons site is crawling good and fromtime to time there is 404 error.  I don't think that the site is offline every few seconds. maybe the specific page was not found, but why the exception displayed and the crawling stop? it will be  better of to continue the crawling and notify on the log and exceptions table.

Top 10 Contributor
229 Posts

http://movinsane.com/
http://www.movin.co.il/

Both will give 404 error. the difference between them both is that the first one gives 404 errors.

Why the exception displayes on VS2008? whyu it won't save the exception data to exception table and continues with crawling?

Maybe there are sites with lots of broken links, and we still wish to crawl them.

how to remove this errors:

Top 10 Contributor
229 Posts

/

Top 10 Contributor
229 Posts

d

Top 10 Contributor
1,905 Posts

Both of those sites come up now.

I'm wondering if those sites throw a 404 error if your user-agent string isn't one of the well-known browers, or coming from GoogleBot or Slurp!, etc...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Check out this post: http://social.msdn.microsoft.com/Forums/en-US/vsdebug/thread/0e4394d5-9f72-46ca-aac2-13beac80629c

This can be solved by setting the proper "Just my code" option.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

arachnode.net:

Both of those sites come up now.

I'm wondering if those sites throw a 404 error if your user-agent string isn't one of the well-known browers, or coming from GoogleBot or Slurp!, etc...

Can it come from the robots.txt file? should I disable ant test it?
how can I know if this issue is awell-known browser issue? can I impose the code to be a well known browser just to make sure this is the issue?

Thanks for the just my code trick.

Top 10 Contributor
229 Posts

some if the pages that extract by arachnode.net are looks like this: movie-877-%EF%BF%BD . this is only part of the absolute uri. it is cut off on the middle for some reaso and the absolute uri goes wrond, and this is the reason for the 404 message.

Now, the question : this is the URL appears on the web site, and this mistake belongs to website, ot arachnode.net fails for some unknown reason when extracting URLs with different eccoding then english.

note: most of the pages crawled well, but just some of them fails.

Top 10 Contributor
229 Posts

I am looking at the DisallowedAbsoluteUris table and I find this line:

16 898 7 http://movin.co.il/movie-861-Rest_Stop:_Don The remote server returned an error: (404) Not Found.

 

Where the real site name is : http://www.movin.co.il/movie-861-Rest_Stop:_Don'T_Look_Back.html

just a guess...maybe when extracting uris it failed to extract characters suh ":_" ??

Seems like there are some problems with extraction, or again, it could be just problems on the specific site.

please let me know what do you think. 

Top 10 Contributor
1,905 Posts

404 errors don't come from robots.txt files.

Check out 'UserAgent' in the 'Configuration' database table.

You are welcome.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (22 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC