arachnode.net v2.0
An open source .NET web crawler written in C# using SQL 2005/2008

Disallowed Directories

rated by 0 users
This post has 8 Replies | 4 Followers

Top 50 Contributor
Female
Posts 4
Anat Posted: 12-13-2009 7:54 AM

Hi Mike,

Following our conversation, I think it might be a good idea to add the feature to limit the crawl to a specific Directory.

For example - If I have a crawl request with "http://edition.cnn.com/US/" as URI, I will want only results from this directory down (i.e http://edition.cnn.com/2009/US/12/11/.....) but not from anywhere in this site like "http://edition.cnn.com/ASIA/".

Hope its clear enough Smile

Thanks,

Anat.

Top 10 Contributor
Posts 155

at the meantime you can create a simple rule and verify that the link is in this directory. a simple search using the String object.

Top 10 Contributor
Male
Posts 920

Hey megetron -

Can you be more specific, please?

Mike

An open source .NET web crawler written in C# using SQL 2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net is provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

Top 10 Contributor
Male
Posts 101

I think you are proposing a crawl rule that just does a string match against the URI being walked to make sure it is in the /us/ folder, yes?

Top 10 Contributor
Male
Posts 920

Yes, this could be accomplished with a CrawlRule. 

I think ANAT is asking to add another Disallowed*, like DisallowedDomains, etc.  Cool feature, I think.  Easy enough to set 'IsDisallowed' in a CrawlRule though.

An open source .NET web crawler written in C# using SQL 2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net is provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

Top 10 Contributor
Posts 155

Kevin:

I think you are proposing a crawl rule that just does a string match against the URI being walked to make sure it is in the /us/ folder, yes?

yes. the feature at any case is a cool one.

Top 10 Contributor
Male
Posts 920

I think I can squeeze this in when I get back from vacation.

Rather easy to accomplish with a CR though.

An open source .NET web crawler written in C# using SQL 2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net is provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

Top 50 Contributor
Female
Posts 4

Hey Mike,

Any News with the Disallowed Directories mentioned above?

Thanks.

Top 10 Contributor
Male
Posts 920

Yes - you can easily accomplish this with a CrawlRule.

See here for links on how to create a CrawlRule: http://arachnode.net/search/SearchResults.aspx?q=CrawlRules

 

An open source .NET web crawler written in C# using SQL 2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net is provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

Page 1 of 1 (9 items) | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2009, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems