Hi Mike,
Following our conversation, I think it might be a good idea to add the feature to limit the crawl to a specific Directory.
For example - If I have a crawl request with "http://edition.cnn.com/US/" as URI, I will want only results from this directory down (i.e http://edition.cnn.com/2009/US/12/11/.....) but not from anywhere in this site like "http://edition.cnn.com/ASIA/".
Hope its clear enough
Thanks,
Anat.
at the meantime you can create a simple rule and verify that the link is in this directory. a simple search using the String object.
Hey megetron -
Can you be more specific, please?
Mike
An open source .NET web crawler written in C# using SQL 2008.
Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872
Twitter: http://twitter.com/arachnode_net
arachnode.net is provides custom crawling and contracting resources. Please ask.
http://bit.ly/TOFX4
I think you are proposing a crawl rule that just does a string match against the URI being walked to make sure it is in the /us/ folder, yes?
Yes, this could be accomplished with a CrawlRule.
I think ANAT is asking to add another Disallowed*, like DisallowedDomains, etc. Cool feature, I think. Easy enough to set 'IsDisallowed' in a CrawlRule though.
Kevin: I think you are proposing a crawl rule that just does a string match against the URI being walked to make sure it is in the /us/ folder, yes?
yes. the feature at any case is a cool one.
I think I can squeeze this in when I get back from vacation.
Rather easy to accomplish with a CR though.
Hey Mike,
Any News with the Disallowed Directories mentioned above?
Thanks.
Yes - you can easily accomplish this with a CrawlRule.
See here for links on how to create a CrawlRule: http://arachnode.net/search/SearchResults.aspx?q=CrawlRules