arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Bug Report Update : Redirect functionality...

rated by 0 users
This post has 0 Replies | 1 Follower

Top 10 Contributor
Posts 1,905
arachnode.net Posted: Thu, May 5 2011 7:28 PM

I recently had a bug come into my email that discussed the following:

http://home.nzcity.co.nz/go.aspx?u=http://www.nzdating.com/

It then goes to http://www.NZDating.com & finds links like http://www.nzdating.com/members/wo.aspx 

And appears to try to index

http://home.nzcity.co.nz/members/wo.aspx (this link belongs to nzdating.com - /members/wo.aspx)

which should actually be

http://www.nzdating.com/members/wo.aspx

I made a change to DataManager.cs and to DiscoveryManager.cs that will check for redirects and automatically update the crawlRequest.Discovery to the crawlRequest.WebClient.HttpWebResponse.ResponseUri if the crawlRequest.WebClient.HttpWebResponse.StatusCode is 300, 301, 302 or 307.

If your application needs to track the submitted AbsoluteUri, check the crawlRequest.WebClient.HttpWebRequest.RequestUri property.


(in DiscoveryManager.cs)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (1 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC