arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

filter by URL

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Sun, Aug 2 2009 11:35 PM

Hi all, I wonder,

What is the quickest way to manipulate by URL ? does arachnode.net use RegEx expresions?

Foe example, lets say I wish to crawl a site called www.mydomain.com on a specific pages only with this syntax www.mydomain.com/category.aspx?id=500

and www.mydomain.com/special.aspx?cat=47

all the pages that has URL in this structure name, are good for crawling and the rest are not.

Is there way doing it through the database? or I will have to filter through the code? is so, where to do such? inside the plugin will be a good place?

Thanks.

All Replies

Top 10 Contributor
1,905 Posts

Check out the CrawlRule AbsoluteUri.cs.  Find all references to it and read the code.  This rule shows how arachnode.net can parse and filter absoluteuris before and after crawling, and how to completely ignore those that don't confirm to your rules.

AN does use regular expressions, but a better way to do it would be to examine the AbsoluteUri parts, since the Uri will already be parsed.  Using a RegEx to reparse the AbsoluteUri won't be the fastest way to parse.  Check out AbsoluteUri.cs, line 293.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC