arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008

List of Questions

rated by 0 users
Not Answered This post has 0 verified answers | 2 Replies | 2 Followers

Top 25 Contributor
Male
11 Posts
Jay Stevens posted on 15 May 2009 6:52 PM

Ok.  Almost completely up and running with VS2008 and SQL2008.  Here are a few questions that I can't seem to track down myself.

 

  1. How do I stop the system from crawling crawling arachnode.net along with whatever I specify?  I love you guys but I can only crawl the site so many times.
  2. What's the best way to crawl a specific list of sites?  I think this is accomplished by adding them to the IsDisallowedAbsoluteURI list and then modifying the Negate attribute?
  3. For those crawled sites, I don't want to follow off-links to other domains.  I thought that setting restricttoHost = true would fix that.  Doesn't seem to as one of the sites I was crawling had a reference to wikipedia and next thing I know I was crawling wikipedia.
  4. How do I get the system to extract and store the source/text from the pages into Webpages_Metadata (which is needed for the TermExtraction.dtsx).  I've set the configuration in the DB, but that doesn't seem to work.
  5. How do I modify the Crawl so it doesn't automatically reject a webpage just because it has a query string in it?  I think this is in the CrawlActions under AbsoluteURI?

I'm getting closer.  Any help is appreciated.

J

 

All Replies

Top 10 Contributor
1,202 Posts

Hey J -

It's late for me - all good questions... let me get back to you over the weekend!

But, good news - Version 1.2 is close... I'm hoping that within a week we'll have it ready!!!

Mike

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
1,202 Posts

1.) Check out the stored procedure and remove the line shown below: (I have removed this from the SP that will be part of Version 1.2)

dbo

.arachnode_usp_arachnode.net_RESET_DATABASE - EXEC dbo.arachnode_omsp_CrawlRequests_INSERT @Datetime, 'http://arachnode.net/', 4, 0, 1, null

2.) That would work.  An even easier way would be to submit CrawlRequests with a high depth, say '100' and restrict crawling to CrawlRequests and WebPages.

3.) I think this actually may be a bug that I fixed for the upcoming Version 1.2 release.  :(  (Sorry...)  If you stop a crawl it doesn't check when saving the requests back to the DB.

4.) Check out the 'Integration' project in the solution.  This is where the terms get split.  The switch you found in the DB is for extracting Text and XML from the WebPages (inserted into WebPages_MetaData) and for enabling creating of an HtmlAgilityPack HtmlDocument.  (The HtmlDocument functionality is very memory intensive.)

5.) Yes - disallowQueryStrings.

<?

xml version="1.0" encoding="utf-8" ?>

<

crawlRules>

<

rule assemblyName="Arachnode.SiteCrawler" typeName="Arachnode.SiteCrawler.Rules.AbsoluteUri" isEnabled="true" order="1" ruleType="3" outputIsDisallowedReason="true" disallowNamedAnchors="true" disallowQueryStrings="false" ...

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Page 1 of 1 (3 items) | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2004-2010, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems