First off, thanks for this solution and nice work. I've read through the documentation and forums and would like to get some assistance in what's the best way to crawl a commercial website like www.homedepot.com to gather product info such as SKU and landing pages. Here's what I've done but it doesn't seem to be working and the crawling stops prematurely.
1. I created a new entry into the CrawlRequests table specifying the URI (www.homedepot.com) and level 4 and only on the starting domain.
I only want to gather product info and not any images or anything else.
Any advice would be most useful.
Thanks!
Thanks! (We are very, very close on a new build as well... :))
This sounds correct.
1.) When you say the crawl is ending prematurely, what do you mean?
2.) Are there any exceptions in the Exceptions table?
3.) For restricting what you are crawling, have you found the Configuration table?
- Mike
An open source .NET web crawler written in C# using SQL 2005/2008.
Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872
Twitter: http://twitter.com/arachnode_net
arachnode.net provides custom crawling and contracting resources. Please ask.
http://bit.ly/TOFX4
C# crawler, C# web crawler, C# site crawler