arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Is it possible to create many instances of the crawler running at the same time.

rated by 0 users
Answered (Verified) This post has 1 verified answer | 5 Replies | 2 Followers

Top 150 Contributor
2 Posts
lanho posted on Tue, Apr 27 2010 2:30 AM

The AN works very well. However, the purpose of my project is to create many instances of the crawler, each one crawls a specific type of information on a specific website. That mean each crawler has its own rule and action. These instances of the crawler can run at the same time.

Is it possible to archive this. Because as I understand, all the instances of the crawler will process all the rules and actions which are enabled and save the discoveries to the same database when the cache is full, then get them back and process. This means that a instance of crawler can process the discoveries of another one.

Is there a solution to archive this purpose. I hope that my idea is understandable. I appreciate any ideas support me on this issue.

Thanks,
Lan

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

The easiest way to accomplish this is to install seperate instances of SQL.

The other way to do this is to create a rule that filters in a switch/case statement.

switch(UserDefinedFunctions.ExtractHost(crawlRequest.Discovery.Uri.AbsoluteUri).Value)

{

case "host1.com":

(perform filtering here...)

break;

}

Then, you will write an EngineAction that fills the Cache according to the same methodology that the Engine does currently.  (Having the source makes showing you this much, much easier...)   Your EngineAction will use the same switch/case statement as your CrawlRule/CrawlAction to determine which server instance AN is running on, and will call a copy of the main CrawlRequest retrieval DB procedure that you will modify to take a parameter to retrieve CrawlRequests according to which server instance you are running.  You will also need to turn off ApplicationSettings.CreateCrawlRequestsFromDatabaseCrawlRequests, as the Engine action will handle the Engine CrawlRequest population.

So, yes, this is possible.  Simple as that!

-Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

The easiest way to accomplish this is to install seperate instances of SQL.

The other way to do this is to create a rule that filters in a switch/case statement.

switch(UserDefinedFunctions.ExtractHost(crawlRequest.Discovery.Uri.AbsoluteUri).Value)

{

case "host1.com":

(perform filtering here...)

break;

}

Then, you will write an EngineAction that fills the Cache according to the same methodology that the Engine does currently.  (Having the source makes showing you this much, much easier...)   Your EngineAction will use the same switch/case statement as your CrawlRule/CrawlAction to determine which server instance AN is running on, and will call a copy of the main CrawlRequest retrieval DB procedure that you will modify to take a parameter to retrieve CrawlRequests according to which server instance you are running.  You will also need to turn off ApplicationSettings.CreateCrawlRequestsFromDatabaseCrawlRequests, as the Engine action will handle the Engine CrawlRequest population.

So, yes, this is possible.  Simple as that!

-Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts
lanho replied on Mon, May 3 2010 3:33 AM

Thanks for your answer Mike. We already had lisence and source code. In fact, I  don't know in advance which site will be crawled. Therefore, I cannot hard code using switch/case statement. I thought about add a property to class ABehavior( DomainApplied for example). Later on, If I want to crawl new site, I create a new dll (Rule and Action) for that site. The DomainApplied is set via Setting column in DB. In method IsDisallowed/PerformAction I get the Domain of the crawlrequest/discovervy and compare to this property. If not equal then return.

Concern issue many instances of cralwer run at the same time. Write an EngineAction that fill the cache is ok in case the domain of the crawlers are not duplicated. For example, I have 2 crawlers. One crawl reviews and one crawl news(can be other type of information) on website amazon. Then if filling the cach base on domain is not correct any more. Moreover, I want to save all these crawlers so that I can run or modify them later. I think of adding a table Crawler and in the table CrawlRequest add a column CrawlerID reference to the Cralwer.

What do you think about this approach. Does it make sense.

Thanks,
-Lan

Top 10 Contributor
1,905 Posts

I think you have a good handle on how AN works, from what you have written.

What likely makes the most sense if for me to finish the multi-server caching.  This isn't difficult for me to do - just adding a socket call after checking the current server's RAM, then placing a socket call to all other crawlers, and then checking the database.

If you need this functionality right away I can be contracted to implement it for you.

- Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

I am making a change to the core which will allow you to adjust the CrawlActions, CrawlRules and EngineActions before Crawling, per Crawl instance, just like you can with ApplicationSettings and WebSettings.

http://arachnode.net/blogs/arachnode_net/archive/2010/05/06/controlling-configuration-from-code.aspx

Stay tuned...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

See this post: http://arachnode.net/forums/p/1188/12433.aspx#12433

You can now selectively turn on/off and configure AN plugins before crawling.  Big Smile

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (6 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC