arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008
IT Professionals & Windows Deployment Professionals: SmartDeploy Enterprise is the first hardware-independent imaging toolset that uses boot time driver-injection, simplifying deployment and easing distribution by reducing total image count. [LINK]

List of Questions

rated by 0 users
Not Answered This post has 0 verified answers | 2 Replies | 2 Followers

Top 25 Contributor
Male
11 Posts
Jay Stevens posted on 15 May 2009 6:52 PM

Ok.  Almost completely up and running with VS2008 and SQL2008.  Here are a few questions that I can't seem to track down myself.

 

  1. How do I stop the system from crawling crawling arachnode.net along with whatever I specify?  I love you guys but I can only crawl the site so many times.
  2. What's the best way to crawl a specific list of sites?  I think this is accomplished by adding them to the IsDisallowedAbsoluteURI list and then modifying the Negate attribute?
  3. For those crawled sites, I don't want to follow off-links to other domains.  I thought that setting restricttoHost = true would fix that.  Doesn't seem to as one of the sites I was crawling had a reference to wikipedia and next thing I know I was crawling wikipedia.
  4. How do I get the system to extract and store the source/text from the pages into Webpages_Metadata (which is needed for the TermExtraction.dtsx).  I've set the configuration in the DB, but that doesn't seem to work.
  5. How do I modify the Crawl so it doesn't automatically reject a webpage just because it has a query string in it?  I think this is in the CrawlActions under AbsoluteURI?

I'm getting closer.  Any help is appreciated.

J

 

All Replies

Top 10 Contributor
1,250 Posts

Hey J -

It's late for me - all good questions... let me get back to you over the weekend!

But, good news - Version 1.2 is close... I'm hoping that within a week we'll have it ready!!!

Mike

For best service when you require assistance:  Big Smile

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

An open source .NET web crawler written in C# using SQL 2005/2008.

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
1,250 Posts

1.) Check out the stored procedure and remove the line shown below: (I have removed this from the SP that will be part of Version 1.2)

dbo

.arachnode_usp_arachnode.net_RESET_DATABASE - EXEC dbo.arachnode_omsp_CrawlRequests_INSERT @Datetime, 'http://arachnode.net/', 4, 0, 1, null

2.) That would work.  An even easier way would be to submit CrawlRequests with a high depth, say '100' and restrict crawling to CrawlRequests and WebPages.

3.) I think this actually may be a bug that I fixed for the upcoming Version 1.2 release.  :(  (Sorry...)  If you stop a crawl it doesn't check when saving the requests back to the DB.

4.) Check out the 'Integration' project in the solution.  This is where the terms get split.  The switch you found in the DB is for extracting Text and XML from the WebPages (inserted into WebPages_MetaData) and for enabling creating of an HtmlAgilityPack HtmlDocument.  (The HtmlDocument functionality is very memory intensive.)

5.) Yes - disallowQueryStrings.

<?

xml version="1.0" encoding="utf-8" ?>

<

crawlRules>

<

rule assemblyName="Arachnode.SiteCrawler" typeName="Arachnode.SiteCrawler.Rules.AbsoluteUri" isEnabled="true" order="1" ruleType="3" outputIsDisallowedReason="true" disallowNamedAnchors="true" disallowQueryStrings="false" ...

For best service when you require assistance:  Big Smile

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

An open source .NET web crawler written in C# using SQL 2005/2008.

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

C# crawler, C# web crawler, C# site crawler

Page 1 of 1 (3 items) | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2004-2010, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems