arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008
IT Professionals & Windows Deployment Professionals: SmartDeploy Enterprise is the first hardware-independent imaging toolset that uses boot time driver-injection, simplifying deployment and easing distribution by reducing total image count. [LINK]

Disallowed Directories

rated by 0 users
This post has 8 Replies | 4 Followers

Top 25 Contributor
Female
Posts 9
Anat Posted: 13 Dec 2009 7:54 AM

Hi Mike,

Following our conversation, I think it might be a good idea to add the feature to limit the crawl to a specific Directory.

For example - If I have a crawl request with "http://edition.cnn.com/US/" as URI, I will want only results from this directory down (i.e http://edition.cnn.com/2009/US/12/11/.....) but not from anywhere in this site like "http://edition.cnn.com/ASIA/".

Hope its clear enough Smile

Thanks,

Anat.

Top 10 Contributor
Posts 219

at the meantime you can create a simple rule and verify that the link is in this directory. a simple search using the String object.

Top 10 Contributor
Posts 1,244

Hey megetron -

Can you be more specific, please?

Mike

For best service when you require assistance:  Big Smile

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

An open source .NET web crawler written in C# using SQL 2005/2008.

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
Male
Posts 101

I think you are proposing a crawl rule that just does a string match against the URI being walked to make sure it is in the /us/ folder, yes?

Top 10 Contributor
Posts 1,244

Yes, this could be accomplished with a CrawlRule. 

I think ANAT is asking to add another Disallowed*, like DisallowedDomains, etc.  Cool feature, I think.  Easy enough to set 'IsDisallowed' in a CrawlRule though.

For best service when you require assistance:  Big Smile

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

An open source .NET web crawler written in C# using SQL 2005/2008.

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
Posts 219

Kevin:

I think you are proposing a crawl rule that just does a string match against the URI being walked to make sure it is in the /us/ folder, yes?

yes. the feature at any case is a cool one.

Top 10 Contributor
Posts 1,244

I think I can squeeze this in when I get back from vacation.

Rather easy to accomplish with a CR though.

For best service when you require assistance:  Big Smile

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

An open source .NET web crawler written in C# using SQL 2005/2008.

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

C# crawler, C# web crawler, C# site crawler

Top 25 Contributor
Female
Posts 9

Hey Mike,

Any News with the Disallowed Directories mentioned above?

Thanks.

Top 10 Contributor
Posts 1,244

Yes - you can easily accomplish this with a CrawlRule.

See here for links on how to create a CrawlRule: http://arachnode.net/search/SearchResults.aspx?q=CrawlRules

 

For best service when you require assistance:  Big Smile

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

An open source .NET web crawler written in C# using SQL 2005/2008.

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

C# crawler, C# web crawler, C# site crawler

Page 1 of 1 (9 items) | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2004-2010, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems