arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Exceptions query, do my results look typical

rated by 0 users
Answered (Verified) This post has 1 verified answer | 2 Replies | 2 Followers

Top 50 Contributor
9 Posts
DataMan posted on Fri, Apr 9 2010 7:51 AM

When I run the following query

SELECT     LEFT(Message, 39) AS Expr1, COUNT(ID) AS Expr2
FROM         Exceptions AS Exceptions_1
GROUP BY LEFT(Message, 39)
HAVING      (LEFT(Message, 39) LIKE 'The remote name could not be resolved:%')
UNION
SELECT     Message AS Expr1, COUNT(ID) AS Expr2
FROM         Exceptions
GROUP BY Message
HAVING      (NOT (Message LIKE 'The remote name could not be resolved:%'))
ORDER BY Expr2 DESC

I get the following resultswhich i've truncated for just the highest numbers:

31604 The remote server returned an error: (404) Not Found.
8855 Unable to connect to the remote server
8390 The remote name could not be resolved:
1263 The remote server returned an error: (401) Unauthorized.
766 The remote server returned an error: (403) Forbidden.
727 The remote server returned an error: (500) Internal Server Error.
545 The operation has timed out
501 Invalid URI: The hostname could not be parsed.
494 'charset=iso-8859-1' is not a supported encoding name. Parameter name: name
390 Too many automatic redirections were attempted.
369 The value of the date string in the header is invalid.
292 The remote server returned an error: (400) Bad Request.
248 The underlying connection was closed: An unexpected error occurred on a send.
215 The request was aborted: The request was canceled.
159 The remote server returned an error: (503) Server Unavailable.
136 'ISO 8859-1' is not a supported encoding name. Parameter name: name
104 'ansi_x3.110-1983' is not a supported encoding name. Parameter name: name

Do the top 4 exception counts seem typical to anyone elses results?  Does it look like I should be concerned about a count of 31604 for the 404 error?

My web page count is 296587 at the moment.

Answered (Verified) Verified Answer

Top 10 Contributor
1,692 Posts

Looks normal to me.

You can always spot check a few.

Things to keep in mind.

1.) Sites may only allow you to retreive X Discoveries in any given period.

2.) This value varies widely from site to site, athough you may not hit their thresholds due to crawling speeds and attempted politeness.  (Round-robin requests, delays.)

3.) Your ISP may restrict the number of connections you can make in any given period.

4.) There are a ton of broken links on the internet.  :D

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

  • Filed under:

All Replies

Top 10 Contributor
1,692 Posts

Looks normal to me.

You can always spot check a few.

Things to keep in mind.

1.) Sites may only allow you to retreive X Discoveries in any given period.

2.) This value varies widely from site to site, athough you may not hit their thresholds due to crawling speeds and attempted politeness.  (Round-robin requests, delays.)

3.) Your ISP may restrict the number of connections you can make in any given period.

4.) There are a ton of broken links on the internet.  :D

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

  • Filed under:
Top 10 Contributor
1,692 Posts

Also, these typically mean that your ISP is throttling you.

8855 Unable to connect to the remote server.

31604 The remote server returned an error: (404) Not Found.  (as discussed)
8855 Unable to connect to the remote server (ISP throttling)
8390 The remote name could not be resolved:  (DNS errors - www.asdfsdgawsrvwesdvsdssdsfsdf.com, etc.)
1263 The remote server returned an error: (401) Unauthorized.
766 The remote server returned an error: (403) Forbidden.
727 The remote server returned an error: (500) Internal Server Error.
545 The operation has timed out (default 60 second timeout to connect)
501 Invalid URI: The hostname could not be parsed. (chrome://)
494 'charset=iso-8859-1' is not a supported encoding name. Parameter name: name (custom meta tags, HttpRequestHeaders implemented improperly by site owners)
390 Too many automatic redirections were attempted.  Recenly impvoed cookie handing should take care of this one.  Check SVN / WebClient.cs
369 The value of the date string in the header is invalid.  (custom webserver headers - bugs on the part of the site owners)
292 The remote server returned an error: (400) Bad Request.
248 The underlying connection was closed: An unexpected error occurred on a send.  (server reboots, bounces, load balancing)
215 The request was aborted: The request was canceled.  (server reboots, bounces, load balancing)
159 The remote server returned an error: (503) Server Unavailable.
136 'ISO 8859-1' is not a supported encoding name. Parameter name: name  (custom webserver headers - bugs on the part of the site owners)
104 'ansi_x3.110-1983' is not a supported encoding name. Parameter name: name  (custom webserver headers - bugs on the part of the site owners)

The last two could likely be enhanced.  Give me an example?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC