arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Operations: selecting and managing a collection of crawl requests

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 75 Contributor
7 Posts
jamesy posted on Wed, Jul 6 2011 10:12 PM

Hi Mike and all,

I'm evaluating AN for a client and my goal is to crawl a collection of URIs on a regular schedule.  Essentially I will run AN as a service and schedule crawls for various sites by specifying those site URIs to AN.  My understanding is that the table CrawlRequests is working memory for AN, so it looks like I would need to create my own table and copy that list of "seed" URIs to CrawlRequests.  Is this the correct strategy?

I would insert into the CrawlRequests table, which requires an [AbsoluteUri2] where the constructor method CrawlRequest only requires AbsoluteUri1.  How is AbsoluteUri2 used?

My collection of URIs will be dynamic where I can update when a new site looks to provide value to include in the collection, and so I need a way to discover relevant outbound links during a crawl and store those somewhere to evaluate.  I'm only crawling the site and not outbound links.  Is there a way to save those outbound links like in something like CrawlRequests, but not actually crawl them?  Subsequently I would manually filter and add some to the collection.

Is there a table of URIs that I specify that I never want to crawl, such as facebook, twitter, ebay, etc... could that table be DisallowedDomains?  

Thanks for any help,

James

 

 

 

 

All Replies

Top 10 Contributor
1,905 Posts

Hello James!

As you know, AN already exists as a service.

Typically (and preferrably), the dbo.CrawlRequests table exists to be used by AN itself and really shouldn't be modified directly, rather favoring using the ArachnodeDAO methods.

There is also an AbsoluteUri0.

0 = The Originator.  (Which AbsoluteUri initially requested the start of the DiscoveryChain)

1 = The Parent.

2 = The Child.

Look at CrawlRequests.txt for the service: http://arachnode.net/search/SearchResults.aspx?q=CrawlRequests.txt

A good way to filter outbound links would be to create a plugin: http://arachnode.net/Content/CreatingPlugins.aspx

Use the AbsoluteUri.cs plugin as an example (which also contains the functionality for cfg.DisallowedDomains), and notice how both methods pass through into the protected IsDisallowed.  You will want to follow this same pattern for your plugin.

When creating the CrawlRequests you want to specify RestrictCrawlTo to your Domain/Host and RestrictDiscoveriesTo to None, which will allow Discoveries from other Domains/Hosts to enter the plugins...

http://arachnode.net/search/SearchResults.aspx?q=RestrictCrawlTo

http://arachnode.net/search/SearchResults.aspx?q=RestrictDiscoveriesTo

From your plugin you would check the properties (domain/host) of the Uri and if it was external, you can set the IsDisallowed property = true, which will prevent AN from creating a CrawlRequest, and call ArachnodeDAO.InsertHyperLink (manually, as InsertHyperLinks/InsertHyperLinkDiscoveries is turned off by default) to store your external HyperLinks.  Alternatively, you could turn both switches on and set the IsStorable property = false for Discoveries that are allowed according to your rules.  Don't set this property for CrawlRequests (just the second method shown) or your Files/Images/WebPages won't be stored.

http://arachnode.net/search/SearchResults.aspx?q=IsStorable

Make sense?  Big Smile

(Also: Best way to see what is happening in the AbsoluteUri.cs plugin is to set AN to a single thread (MaximumNumberOfCrawlThreads), ensure the AbsoluteUri.cs CrawlRule is enabled (it is in the DB under cfg.CrawlRules, and may be disabled in code in Console\Program.cs... look for _crawler.CrawlRules), set a breakpoint at each of the methods and examine the callstack when they are triggered.  Finally, you may use the Immediate window to simulate IsDiallowed and IsStorable and all should be clear.)

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC