arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Format absolute uri before insert to discovery

rated by 0 users
Answered (Verified) This post has 1 verified answer | 7 Replies | 2 Followers

Top 10 Contributor
30 Posts
pp.ps posted on Wed, Mar 10 2010 6:56 AM

How to format absolute uri before insert to discovery, for example lot o url's has a session id variable which is changed every time and its inserted to hyperlink as new url. 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Sure there is.

Look at AbsoluteUri.cs... the overload that deals with Discoveries.  Follow this model and disallow discoveries based on the presence of your repeated session variables, or look to modify the CacheKey.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

You could modify Discovery.cs or create a CrawlRule/CrawlAction to perform the formatting you desire.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

I'm not sure what you are askng?

Are you asking me what I think about your idea or are you asking me to write the plugin for you?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Yes.  Feel free to create a plugin.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Sure there is.

Look at AbsoluteUri.cs... the overload that deals with Discoveries.  Follow this model and disallow discoveries based on the presence of your repeated session variables, or look to modify the CacheKey.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (8 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC