arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Disallowed Directories

rated by 0 users
This post has 8 Replies | 4 Followers

Top 50 Contributor
Female
Posts 9
Anat Posted: Sun, Dec 13 2009 7:54 AM

Hi Mike,

Following our conversation, I think it might be a good idea to add the feature to limit the crawl to a specific Directory.

For example - If I have a crawl request with "http://edition.cnn.com/US/" as URI, I will want only results from this directory down (i.e http://edition.cnn.com/2009/US/12/11/.....) but not from anywhere in this site like "http://edition.cnn.com/ASIA/".

Hope its clear enough Smile

Thanks,

Anat.

Top 10 Contributor
Posts 229

at the meantime you can create a simple rule and verify that the link is in this directory. a simple search using the String object.

Top 10 Contributor
Posts 1,905

Hey megetron -

Can you be more specific, please?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
Posts 101
Kevin replied on Wed, Jan 6 2010 11:16 AM

I think you are proposing a crawl rule that just does a string match against the URI being walked to make sure it is in the /us/ folder, yes?

Top 10 Contributor
Posts 1,905

Yes, this could be accomplished with a CrawlRule. 

I think ANAT is asking to add another Disallowed*, like DisallowedDomains, etc.  Cool feature, I think.  Easy enough to set 'IsDisallowed' in a CrawlRule though.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Posts 229

Kevin:

I think you are proposing a crawl rule that just does a string match against the URI being walked to make sure it is in the /us/ folder, yes?

yes. the feature at any case is a cool one.

Top 10 Contributor
Posts 1,905

I think I can squeeze this in when I get back from vacation.

Rather easy to accomplish with a CR though.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
Female
Posts 9
Anat replied on Mon, Feb 8 2010 7:56 AM

Hey Mike,

Any News with the Disallowed Directories mentioned above?

Thanks.

Top 10 Contributor
Posts 1,905

Yes - you can easily accomplish this with a CrawlRule.

See here for links on how to create a CrawlRule: http://arachnode.net/search/SearchResults.aspx?q=CrawlRules

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (9 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC