arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does arachnode.net scale? | Download the latest release

CrawlRule or CrawlAction?

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 25 Contributor
23 Posts
JCrawl posted on Wed, Jul 1 2015 3:56 PM

Not sure if I need to create a crawl rule or a crawl action... 

I am trying to extend with a plugin.. I am looking for pages with specific content and if it exists then I would like to save the specific content to the database but not the store the actual content of the site... I originally did this as a crawl rule however... my performance has dropped really badly. I am using the HTMLAgility pack to look if the criteria exists in the page.

 

Any suggestions on how I can figure out why I have slowed down so much would be appreciated.

 

Thank you 

 

 

All Replies

Top 10 Contributor
1,905 Posts

CrawlRules should be used for filtering, and are called once per Discovery, for every Discovery - like, if you find 100 HyperLinks on a page, you will call your CrawlRule 100 times to check to see if the Discovery 'IsDisallowed', 'IsStorable'

CrawlActions are used to 'Do Something'...  and are only called once according to the following designations (which you already know, just adding an image for others):

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC