arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

How to handle a webservice with articles

rated by 0 users
Answered (Verified) This post has 1 verified answer | 10 Replies | 2 Followers

Top 75 Contributor
Male
5 Posts
Peter posted on Wed, Mar 21 2012 6:45 AM

Hi all,

We sould like to use Arachnode for crawling a webservice which contains articles. this means I do not have links, I have one service which I would like to call myself and then add each Article Object in for spidering. How should I start with something like this? Is Arachnode a good choice for me?

Is the Crawler only useful for websites, or also for a service like I think I could use it?

Answered (Verified) Verified Answer

Top 10 Contributor
1,714 Posts
Verified by Peter

Wonderful!

The installation instructions will answer a good chunk of your questions on how the crawler works: http://arachnode.net/Content/InstallationInstructions.aspx

The released (SVN) code contains a ton of comments on the code, and Program.cs is especially helpful in describing what occurs when.

How to create a plugin: http://arachnode.net/Content/CreatingPlugins.aspx

To add CrawlRequests to be crawled either use a reference to the Crawler in the instantiating application (_crawler.Crawl(...);) or in a Plugin via crawlRequest.Crawler.Crawl(...);  http://arachnode.net/forums/t/766.aspx

Just about any questions you could answer has been answered in the forums.  The search isn't great, but it isn't bad either.  Big Smile

I am always glad to answer questions and provide personalized documentation in the form of forum posts, email and/or videos describing how to do what you'd like to do.

Crawler execution plan, in short: CrawlRequests are submitted to the Crawler, if the CrawlRequest is new then it is added to the Cache.  When the Crawl threads need new things to crawl, they ask the Engine for a batch of CrawlRequests.  The CrawlRequests are shuffled by domain, priority and the time the domain was last contacted.  Plugins run before contacting a site, after sending the initial "get" request and after the content has been downloaded.  Asynchronous discovery processors save the data while the paired Crawl thread downloads additional Discoveries.  The process repeats until there is nothing left to crawl.  Smile

Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,714 Posts

Yes, you can use AN for this.

You would call your webservice initially to provide AN with something to crawl, and then in a plugin if you needed to call your webservice again you would do this there and add the results of the webservice to the queue.

Does this answer your question?

Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
Male
5 Posts
Peter replied on Thu, Mar 22 2012 12:46 AM

Hi,

Yes it's a start :) is there some kind of document on how the execution of the crawler works? I didn't find any documents on the execution plan of the crawler, the code also does not contain any comments on the classes so I have no Idea on how to start this.

Is there by any chance a sample project in which this is explained? Or could you give me some directions in what type of plugin I would have to write and where to find the queue?

Top 75 Contributor
Male
5 Posts
Peter replied on Thu, Mar 22 2012 1:11 AM

another question: when we would buy a licence, would we get the code that is now scrabbled in the example code?

Top 10 Contributor
1,714 Posts
Verified by Peter

Wonderful!

The installation instructions will answer a good chunk of your questions on how the crawler works: http://arachnode.net/Content/InstallationInstructions.aspx

The released (SVN) code contains a ton of comments on the code, and Program.cs is especially helpful in describing what occurs when.

How to create a plugin: http://arachnode.net/Content/CreatingPlugins.aspx

To add CrawlRequests to be crawled either use a reference to the Crawler in the instantiating application (_crawler.Crawl(...);) or in a Plugin via crawlRequest.Crawler.Crawl(...);  http://arachnode.net/forums/t/766.aspx

Just about any questions you could answer has been answered in the forums.  The search isn't great, but it isn't bad either.  Big Smile

I am always glad to answer questions and provide personalized documentation in the form of forum posts, email and/or videos describing how to do what you'd like to do.

Crawler execution plan, in short: CrawlRequests are submitted to the Crawler, if the CrawlRequest is new then it is added to the Cache.  When the Crawl threads need new things to crawl, they ask the Engine for a batch of CrawlRequests.  The CrawlRequests are shuffled by domain, priority and the time the domain was last contacted.  Plugins run before contacting a site, after sending the initial "get" request and after the content has been downloaded.  Asynchronous discovery processors save the data while the paired Crawl thread downloads additional Discoveries.  The process repeats until there is nothing left to crawl.  Smile

Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,714 Posts

Yes.  There are 6 or so additional helper projects included in the released source as well as an additional crawler.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
Male
5 Posts
Peter replied on Mon, Mar 26 2012 5:02 AM

Very complete answers, thank you very much, my investigation continues :)

I'll keep you posted

Top 10 Contributor
1,714 Posts

Of course!  I'm here to help.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
Male
5 Posts
Peter replied on Tue, Apr 10 2012 1:05 AM

Hi,

We're trying to understand how to get data from our webservice and pass it on to Arachnode. The only example of how to pass data into the crawler I can find is:

_crawler.Crawl(new CrawlRequest(new Discovery("http://tmz.com"), 2, UriClassificationType.None, UriClassificationType.None, 1, RenderType.None, RenderType.None));

This is a good way to send in data that is behind a website address, but how do I send in data straight away? I do not have a link, I have a soap service. I'm calling the service myself, After that I would like to send that data into the crawler. 

 

grtz

Top 10 Contributor
1,714 Posts

Give me a moment and I'll code up a quick example.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,714 Posts

To give AN something to crawl initially from a WebService you'll need to create a plugin, or customize an existing one like I did for this example.

Here's how to create a new plugin: http://arachnode.net/Content/CreatingPlugins.aspx

I added a method to ManageLuceneDotNetIndexes as this plugin is wired and ready to go from the default configuration from SVN.

Obviously, there will need to exist logic that instructs the plugin when the first query is to be made.

As you are using a SOAP service, you can simply add a WebReference in the same project where your plugin exists.

I used AN's demo search in this example.  As I understand your problem, your webservice returns data that contains information on what to crawl next, right?  With each WebPage that is queued to crawl, the Plugin will fire and you'll have the opportunity to insert new CrawlRequests.  To give AN something to crawl initially, call the Plugin's method from Program.cs, or from wherever else you instantiate AN from.  Big Smile

Does this make sense?

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (11 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC