arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Crawl Restriction and dbo.Domain table

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 1 Follower

Top 75 Contributor
7 Posts
rlink12 posted on Wed, Nov 13 2013 9:31 PM

Hi There -

 

I have two questions:

 

1.       I am trying to figure out why the crawler is not populating the dbo.domain and dbo.domain_discoveries table.  I suspects its one of the application settings and have not been able to figure out which one.

 

2.       How can a crawl request be restricted to a specific URL and all the pages found below it?   for ex: www.mysite.com/about.aspx - I only want to crawl content found at the URI or below it. Its a combination of the URI classification types, I just have been un-able to figure-out the correct combination.

Thanks,

 

R

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

1.) ApplicationSettings.ClassifyAbsoluteUris - there is a perf. tweak in the 3.5 code - if you are interested / need to get it going before 3.5 is released contact me on Skype.

2.) http://arachnode.net/forums/p/1920/42397.aspx#42397 - 'restrictCrawlTo' in the CR constructor - you probably want 'Host | DirectoryLevel | DirectoryLevelDown'.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC