We sould like to use Arachnode for crawling a webservice which contains articles. this means I do not have links, I have one service which I would like to call myself and then add each Article Object in for spidering. How should I start with something like this? Is Arachnode a good choice for me?
Is the Crawler only useful for websites, or also for a service like I think I could use it?
The installation instructions will answer a good chunk of your questions on how the crawler works: http://arachnode.net/Content/InstallationInstructions.aspx
The released (SVN) code contains a ton of comments on the code, and Program.cs is especially helpful in describing what occurs when.
How to create a plugin: http://arachnode.net/Content/CreatingPlugins.aspx
To add CrawlRequests to be crawled either use a reference to the Crawler in the instantiating application (_crawler.Crawl(...);) or in a Plugin via crawlRequest.Crawler.Crawl(...); http://arachnode.net/forums/t/766.aspx
Just about any questions you could answer has been answered in the forums. The search isn't great, but it isn't bad either.
I am always glad to answer questions and provide personalized documentation in the form of forum posts, email and/or videos describing how to do what you'd like to do.
Crawler execution plan, in short: CrawlRequests are submitted to the Crawler, if the CrawlRequest is new then it is added to the Cache. When the Crawl threads need new things to crawl, they ask the Engine for a batch of CrawlRequests. The CrawlRequests are shuffled by domain, priority and the time the domain was last contacted. Plugins run before contacting a site, after sending the initial "get" request and after the content has been downloaded. Asynchronous discovery processors save the data while the paired Crawl thread downloads additional Discoveries. The process repeats until there is nothing left to crawl.
For best service when you require assistance:
Yes, you can use AN for this.
You would call your webservice initially to provide AN with something to crawl, and then in a plugin if you needed to call your webservice again you would do this there and add the results of the webservice to the queue.
Does this answer your question?
Yes it's a start :) is there some kind of document on how the execution of the crawler works? I didn't find any documents on the execution plan of the crawler, the code also does not contain any comments on the classes so I have no Idea on how to start this.
Is there by any chance a sample project in which this is explained? Or could you give me some directions in what type of plugin I would have to write and where to find the queue?
another question: when we would buy a licence, would we get the code that is now scrabbled in the example code?
Yes. There are 6 or so additional helper projects included in the released source as well as an additional crawler.
Very complete answers, thank you very much, my investigation continues :)
I'll keep you posted
Of course! I'm here to help.
We're trying to understand how to get data from our webservice and pass it on to Arachnode. The only example of how to pass data into the crawler I can find is:
_crawler.Crawl(new CrawlRequest(new Discovery("http://tmz.com"), 2, UriClassificationType.None, UriClassificationType.None, 1, RenderType.None, RenderType.None));
This is a good way to send in data that is behind a website address, but how do I send in data straight away? I do not have a link, I have a soap service. I'm calling the service myself, After that I would like to send that data into the crawler.
Give me a moment and I'll code up a quick example.
To give AN something to crawl initially from a WebService you'll need to create a plugin, or customize an existing one like I did for this example.
Here's how to create a new plugin: http://arachnode.net/Content/CreatingPlugins.aspx
I added a method to ManageLuceneDotNetIndexes as this plugin is wired and ready to go from the default configuration from SVN.
Obviously, there will need to exist logic that instructs the plugin when the first query is to be made.
As you are using a SOAP service, you can simply add a WebReference in the same project where your plugin exists.
I used AN's demo search in this example. As I understand your problem, your webservice returns data that contains information on what to crawl next, right? With each WebPage that is queued to crawl, the Plugin will fire and you'll have the opportunity to insert new CrawlRequests. To give AN something to crawl initially, call the Plugin's method from Program.cs, or from wherever else you instantiate AN from.
Does this make sense?