arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does arachnode.net scale? | Download the latest release

Manipulte URI before checking Cache and DB

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 25 Contributor
23 Posts
JCrawl posted on Sun, Jul 5 2015 4:21 PM

Hello

Is there a location to manipulate the URI coming in before checking if it is in the cache.

IE

http://someuri/date=today

http://someuri/date=FiveMinuteLater

For me... I do not care that the page is five minutes older... I can get the same data by using simply

http://someuri/

So instead of processing http://someuri/date=today or any other variation, I would like to trim the URI before checking it is in the cache / DB

 

Let me know if this is possible

 

 

 

All Replies

Top 10 Contributor
1,905 Posts

Look at the AbsoluteUri.cs CrawlRule.  You can filter on QueryStrings.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC