arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008
IT Professionals & Windows Deployment Professionals: SmartDeploy Enterprise is the first hardware-independent imaging toolset that uses boot time driver-injection, simplifying deployment and easing distribution by reducing total image count. [LINK]

Getting Started

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 75 Contributor
2 Posts
mcnisiv posted on 31 May 2009 9:32 PM

First off, thanks for this solution and nice work.  I've read through the documentation and forums and would like to get some assistance in what's the best way to crawl a commercial website like www.homedepot.com to gather product info such as SKU and landing pages.  Here's what I've done but it doesn't seem to be working and the crawling stops prematurely.

1. I created a new entry into the CrawlRequests table specifying the URI (www.homedepot.com) and level 4 and only on the starting domain.

I only want to gather product info and not any images or anything else.

 

Any advice would be most useful.

 

Thanks!

All Replies

Top 10 Contributor
1,244 Posts

Thanks!  (We are very, very close on a new build as well... :))

This sounds correct.

1.) When you say the crawl is ending prematurely, what do you mean?

2.) Are there any exceptions in the Exceptions table?

3.) For restricting what you are crawling, have you found the Configuration table?

- Mike

 

For best service when you require assistance:  Big Smile

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

An open source .NET web crawler written in C# using SQL 2005/2008.

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

C# crawler, C# web crawler, C# site crawler

Page 1 of 1 (2 items) | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2004-2010, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems