arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Problem with uri with / at the end

rated by 0 users
Not Answered This post has 0 verified answers | 2 Replies | 2 Followers

Top 10 Contributor
30 Posts
pp.ps posted on Wed, Nov 17 2010 8:40 AM | Locked
This post has been deleted.

All Replies

Top 10 Contributor
1,905 Posts

There seems to be a difference of opinion between the .NET Uri creation scheme and what some browsers (but not all) do when combining relative Uris with directories that contain trailing slashes.  I have a ticket open with MS Connect on this... 

More to come...

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

OK.  Some AbsoluteUri normalization has to be put in place.

http://www.tmz.com/page/12/ == http://www.tmz.com/page/12 in terms of browsing, but to AN, these look like two separate and distinct AbsoluteUris, which means the page could be crawled twice.

I am still checking on one other piece of this puzzle.

 

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC