arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Creating Plugins

I. Overview of existing plugins:

1.) CrawlActions:

  • Anonymizer.cs : Provides AbsoluteUri modification suitable for submitting to an anonymous proxy.
  • BayesianClassifier.cs : Provides a self-classifying Bayesian classifier.
  • CustomManageLuceneDotNetIndexes.cs : Illustrates how to extend an existing Plugin.
  • DiscoveryChain.cs : Records the exact crawling route the Crawler took to discover the Discovery.
  • ManageLuceneDotNetIndexes.cs : Create Lucene.NET indexes from downloaded Discoveries.
  • Templater.cs : Extracts the most meaningful sections of text from a WebPage.

2.) CrawlRules:

  • AbsoluteUri.cs : Provides AbsoluteUri filtering for CrawlRequests and Discoveries.
  • ContentLength.cs : Filters Discoveries based on content length, or file size.
  • DataType.cs : Filters Discoveries bases on DataType, or file type.
  • Frequency.cs : Governs how often the Crawler requests data from a Domain.
  • ResponseHeaders.cs : Filters Discoveries bases on their returned response headers.
  • ResponseUri.cs : Filters Discoveries based on redirection rules.
  • RobotsDotText.cs : Filters Discoveries based on robots.txt.
  • Source.cs : Filters Discoveries based on content.
  • StatusCode.cs : Filters Discoveries based on status code.  (404, 503, etc.)
  • Storable.cs : Provides a template for implementing IsStorable filtering.

3.) CrawlAction, CrawlRule and EngineAction declaration and configurations are stored in cfg.CrawlActions, cfg.CrawlRules and cfg.EngineActions, respectively. Configuration for each is stored in the Settings column, in a KeyValuePair, with each Key and Value separated by an equals sign, and each KeyValuePair separated by a vertical bar.

4.) The ActionManager.cs dynamically loads each type defined in cfg.CrawlActions and cfg.EngineActions and executes each CrawlAction at the configured stage in crawling.

5.) The RuleManager.cs dynamically loads each type defined in cfg.CrawlRules and executes each CrawlRule at the configured stage in crawling.

6.) The outlined section in RuleManager.cs illustrates the creation of the KeyValuePair configuration settings from the database.

7.) The outlined section in AbsoluteUri.cs illustrates the consumption and assignment of the KeyValuePair configuration settings.

8.) AbsoluteUri.cs illustrates a CrawlRule that is called PreRequest and PostRequest. Both CrawlRequests and Discoveries are processed by CrawlRules.

9.) In contract, CrawlActions only operate on CrawlRequests.

10.) CrawlActions, CrawlRules and EngineActions extend ABehavior, which defines a common set of rules for instantiation and execution.

11.) CrawlRequests and Discoveries extend ADisallowed, which is used to restrict, or disallow Discoveries from being crawled.

12.) CrawlRequests and Discoveries also extend AStorable, through ADisallowed.  AStorable instructs the FileManager.cs, ImageManager.cs and WebPageManager.cs to insert the Discovery into the database.

II. Creating new plugins:

1.) Create a new CrawlAction in project 'Plugins'. The new CrawlAction must extend ACrawlAction. This example extracts the Host from the CrawlRequests's Discovery and creates a text file at c:\NewCrawlAction.

2.) Create a new CrawlRule in project 'Plugins'. The new CrawlRequest must extend ACrawlRule. This example filters CrawlRequests and Discoveries based on two rules. a.) If the CrawlRequest or Discovery's Host does not contain the letter 'a', then it will be disallowed and b.) If the CrawlRequest or Discovery's Host does not contain the letter 'z', then it will not be storable.

3.) Enter values for CrawlActonTypeID, AssemblyName, TypeName, IsEnabled, Order and Settings in cfg.CrawlActions for 'NewCrawlAction'.

4.) Enter values for CrawlRuleTypeID, AssemblyName, TypeName, IsEnabled, Order, OutputIsDisallowedReason and Settings in cfg.CrawlRule for 'NewCrawlRule'.

5.) The values for CrawlActionType and CrawlRuleType are found in cfg.CrawlActionTypes and cfg.CrawlRuleTypes. PreRequest = Before the AbsoluteUri is contacted. PreGet = After the HttpHeaders have been downloaded, but before the content is downloaded. PostRequest = After the content has been downloaded.

6.) Check 'Program.cs' in the 'Console' project to ensure that 'NewCrawlAction' and 'NewCrawlRule' are enabled, and all other plugins are disabled. This configuration is helpful in determining that our customizations are performing as expected.

7.) Begin crawling and verify that our plugins are loaded and enabled.

8.) Examine the output of 'NewCrawlAction'.

9.) Examine the result of 'NewCrawlRule'. 'http://amazon.com' was not disallowed due to the presence of 'a', and was stored due to the presence of 'z'.

An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC