arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

404 is returned because trailing slash is not used

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 2 Followers

Top 200 Contributor
1 Posts
canuckbbp posted on Thu, Nov 10 2011 3:05 PM

When I crawl this site:

http://www.jenkinskling.com

the following response is returned:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

<title>R a z o r B a l l</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta http-equiv="refresh" content="0;url=http://www.jenkinskling.com/jenkinskling/">

<script type="text/javascript" language="JavaScript">document.location.href="http://www.jenkinskling.com/jenkinskling/";</script>

</head>

<body> </body>

</html>

 

Arachnode sets up the next crawl request for http://www.jenkinskling.com/jenkinskling instead of http://www.jenkinskling.com/jenkinskling/ which results in a 404.

I started looking around in the weblient.cs and DataManager.cs files but I don't really want to change these.  Any ideas?

Answered (Verified) Verified Answer

Top 10 Contributor
1,746 Posts
Verified by arachnode.net

This is one place where I felt I needed to make a concession for duplicate WebPages (which also serves as a bugfix for one condition in the .NET Uri parsing, for which I have an open Connect bug which was never addressed.)  Tongue Tied  e.g. http://www.jenkinskling.com/jenkinskling/ and http://www.jenkinskling.com/jenkinskling are the same page on 99% of WebServers.

If you want to change this, look at DiscoveryManager.cs and find: if (Uri.TryCreate(match.Groups["HyperLink"].Value.TrimEnd('/'), UriKind.RelativeOrAbsolute, out hyperLinkDiscovery)).  Remove the "TrimEnd('/')" portion.

As an alternative you can create a plugin with the code from DiscoveryManager.cs and add to the Discoveries that way.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,746 Posts

I can take a look at this later today.  It should be an easy fix.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,746 Posts

This is one place where I felt I needed to make a concession for duplicate WebPages (which also serves as a bugfix for one condition in the .NET Uri parsing, for which I have an open Connect bug which was never addressed.)  Tongue Tied  e.g. http://www.jenkinskling.com/jenkinskling/ and http://www.jenkinskling.com/jenkinskling are the same page on 99% of WebServers.

If you want to change this, look at DiscoveryManager.cs and find: if (Uri.TryCreate(match.Groups["HyperLink"].Value.TrimEnd('/'), UriKind.RelativeOrAbsolute, out hyperLinkDiscovery)).  Remove the "TrimEnd('/')" portion.

Thanks!
Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,746 Posts
Verified by arachnode.net

This is one place where I felt I needed to make a concession for duplicate WebPages (which also serves as a bugfix for one condition in the .NET Uri parsing, for which I have an open Connect bug which was never addressed.)  Tongue Tied  e.g. http://www.jenkinskling.com/jenkinskling/ and http://www.jenkinskling.com/jenkinskling are the same page on 99% of WebServers.

If you want to change this, look at DiscoveryManager.cs and find: if (Uri.TryCreate(match.Groups["HyperLink"].Value.TrimEnd('/'), UriKind.RelativeOrAbsolute, out hyperLinkDiscovery)).  Remove the "TrimEnd('/')" portion.

As an alternative you can create a plugin with the code from DiscoveryManager.cs and add to the Discoveries that way.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC