arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release
How to use the arachnode.net Plugins with AN.Next.

 

 

This example uses the most commonly used plugin, ManageLuceneDotNetIndexes.cs.

Many plugins require that a Dictionary<string, string> be passed to AssignSettings(...), and this applies to ManageLuceneDotNetIndexes.cs.

These settings correspond to the 'Settings' column in cfg.CrawlActions.

As ManageLuceneDotNetIndexes.cs is a class just like any other we can create an instance of it in AN.Next.

We'll need to add references to 'Plugins' as well as to the 'SiteCrawler' project.

Add the member variables shown to AN.Next's Program.cs.  We're storing a reference to the _crawlerPlugins so we can use the ArachnodeDAO assigned to the particular crawl thread that fires the WebPage/File/Image completed event.

Also, notice the call to DataTypeManager.Instance().RefreshDataTypes();  The lucene plugin expects the crawlRequest.DataType property to be populated.  The DataTypes come from cfg.AllowedDataTypes.

In the crawler_CrawlerStatus event handler, we've added code to populate the settings that would otherwise come from ActionManager.cs in the SiteCrawler project.  Again, reference cfg.CrawlActions.

A few plugins have code in the Stop() method that should be executed.  The indexing plugin does.

Now, the plugin is ready to be used.

We need to make a translation between the AN.Next's CR and arachnode.net's CR.

Then, in the event handlers that signify completion of a WebPage/File/Image, we'll call the indexing plugin.  Notice that the WebPageManager is called.  The Lucene indexing expects that the file can be found on disk, and so we'll call the WebPageManager to do so.

To get the solution to build you will have to make internal and private properties in the SiteCrawler project public.

A small modification was made to ManageLuceneDotNetIndexes.cs.  Both the modified AN.Next Program.cs and ManageLuceneDotNetIndexes.cs files are attached.


Posted Sat, Dec 29 2012 11:38 AM by arachnode.net
Attachment: ANNextIndexing.zip
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC