arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

rewrite URL

rated by 0 users
Not Answered This post has 0 verified answers | 4 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Sat, Aug 8 2009 4:58 AM

Hello,

How does arachnode.net behaves when crawling websites that rewrite URLs?

The same page could posses several names using the rewrite url, and so, longer crawling for the site and multiple records into database, webpages, images, files and so on.

 and another thing I have in mind.

what do we do when there is 1 url for the page and seconds url for printing. for example "mydomain.com/page1.aspx" and the second page is "mydomain.com/page1,aspx?print=true" , how to deal this pages and remove them from crawl using the absolueuri rules?

All Replies

Top 10 Contributor
1,905 Posts

All AbsoluteUris must pass through one of the constructors of Discovery.cs.

...be right back...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

So what you are saying is that I will have to manually detect the fake absoluite uris?

does arachnode.net holds the original uris? before the rewrite url action?

 

 

Top 10 Contributor
1,905 Posts

megetron:

So what you are saying is that I will have to manually detect the fake absoluite uris?

does arachnode.net holds the original uris? before the rewrite url action? 

No.  Check out AbsoluteUri.cs.  This CrawlRule takes care of QueryStrings and NamedAnchors.  You just want to filter them, not rewrite them or destroy them.  If they are disallowed they will be stored in the DisallowedAbsoluteUris table, if 'InsertDisallowedAbsoluteUris' is set to 'true'.

Submit a Url with a query string and it will be placed in the table DisallowedAbsoluteUris under the default configuration.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

I am not sure we get to the bottom of this:]

lots of web developers today using components that help them to rewrite the url of thier own websites. for example :
http://www.simple-talk.com/dotnet/asp.net/a-complete-url-rewriting-solution-for-asp.net-2.0/

Lets say I have a page called http://mydomain.com/page.html?id=9 and with these controls I can rewrite the url to a new name called: http://mydomain.com/9_nice_title.html which is a very catchy name for the page.

the problem is that this page is accessible in the 2 urls above. 2 uris for 1 page. there are sites that can give the same name even 3 uris for the same page.

in arachnode.net I can see that it will crawls for each uri 1 time. that means it will crawl the same page more than 1 time because there are different names to the same page "thanks" to that control.

what I am asking is, how to prevent this repeated crawling on the same page.

on disallowedabsoluteuris table I cannot find one of these 2 uris which is obvious because the uris are both legal.

 

Page 1 of 1 (5 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC