arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

List of Questions

rated by 0 users
Not Answered This post has 0 verified answers | 2 Replies | 2 Followers

Top 50 Contributor
Male
11 Posts
Jay Stevens posted on Fri, May 15 2009 6:52 PM

Ok.  Almost completely up and running with VS2008 and SQL2008.  Here are a few questions that I can't seem to track down myself.

 

  1. How do I stop the system from crawling crawling arachnode.net along with whatever I specify?  I love you guys but I can only crawl the site so many times.
  2. What's the best way to crawl a specific list of sites?  I think this is accomplished by adding them to the IsDisallowedAbsoluteURI list and then modifying the Negate attribute?
  3. For those crawled sites, I don't want to follow off-links to other domains.  I thought that setting restricttoHost = true would fix that.  Doesn't seem to as one of the sites I was crawling had a reference to wikipedia and next thing I know I was crawling wikipedia.
  4. How do I get the system to extract and store the source/text from the pages into Webpages_Metadata (which is needed for the TermExtraction.dtsx).  I've set the configuration in the DB, but that doesn't seem to work.
  5. How do I modify the Crawl so it doesn't automatically reject a webpage just because it has a query string in it?  I think this is in the CrawlActions under AbsoluteURI?

I'm getting closer.  Any help is appreciated.

J

 

All Replies

Top 10 Contributor
1,905 Posts

Hey J -

It's late for me - all good questions... let me get back to you over the weekend!

But, good news - Version 1.2 is close... I'm hoping that within a week we'll have it ready!!!

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

1.) Check out the stored procedure and remove the line shown below: (I have removed this from the SP that will be part of Version 1.2)

dbo

.arachnode_usp_arachnode.net_RESET_DATABASE - EXEC dbo.arachnode_omsp_CrawlRequests_INSERT @Datetime, 'http://arachnode.net/', 4, 0, 1, null

2.) That would work.  An even easier way would be to submit CrawlRequests with a high depth, say '100' and restrict crawling to CrawlRequests and WebPages.

3.) I think this actually may be a bug that I fixed for the upcoming Version 1.2 release.  :(  (Sorry...)  If you stop a crawl it doesn't check when saving the requests back to the DB.

4.) Check out the 'Integration' project in the solution.  This is where the terms get split.  The switch you found in the DB is for extracting Text and XML from the WebPages (inserted into WebPages_MetaData) and for enabling creating of an HtmlAgilityPack HtmlDocument.  (The HtmlDocument functionality is very memory intensive.)

5.) Yes - disallowQueryStrings.

<?

xml version="1.0" encoding="utf-8" ?>

<

crawlRules>

<

rule assemblyName="Arachnode.SiteCrawler" typeName="Arachnode.SiteCrawler.Rules.AbsoluteUri" isEnabled="true" order="1" ruleType="3" outputIsDisallowedReason="true" disallowNamedAnchors="true" disallowQueryStrings="false" ...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC