arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Dynamic Content

rated by 0 users
Answered (Not Verified) This post has 0 verified answers | 13 Replies | 1 Follower

posted on Sun, Sep 19 2010 10:32 AM

Hi Mike,

Basically I am looking to extract the web data, from some of the publicly available websites. Web Data means, I want to get all the data available in a portal (Ex: I want to extract all the Digital Camera Makes, Models, Features and Prices from Amazon.com. I want to extract the dynamic data , that gets generated on click of search). Can i use Arachnode for this purpose ?

 

All Replies

Top 10 Contributor
1,905 Posts

Absolutely.  I crawl several major sites with this exact purpose.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

replied on Mon, Sep 20 2010 12:07 AM

Hi Mike,

We are using the trial version of Arachnode.Net, just to see whether it matches our above said requirements. Using the trial version we are able to crawl and download the static html pages for a particular website, but it fails to gather dynamic data (which is typically the search results of a product page of an e-commerce website).

Is the algorithm for discovering dynamic pages a part of Arachnode code (Licensed version) ?

If Yes, do you have any Write Up/Demo to guide us in using Arachnode.Net to extract dynamic data ?

We are in the final stages of evaluating your software for purchase.

Thanks,

Yathish

Top 10 Contributor
1,905 Posts

Great!

I haven't made an official write-up since you are the second person to seriously inquire about dynamic rendering.

Here's how it works:

Enable the Renderers in the Crawler constructor, second parameter.

Then, when you create a CrawlRequest, enable Dynamic rendering for the CR and its children.

To note, you probably won't be able to hover over parameters from the CrawlRequest.HtmlDocument, due to the logic work that I have done to allow mulitple threads to independantly use the IE rendering engine.  (You're not really supposed to be able to do this...)  But, you can assign variables from the HtmlDocument property and this will work.  (string innerHtml = crawlRequest.HtmlDocument.Body.InnerHtml, etc.)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts

When we set the enableRenderers parameter to true and set the RenderType to Dynamic, the crawler keeps on running (till I stopped it manually after 20 mins). But, the Files, Images and WebPages folders are empty. Could you please let us know what we are doing wrong after analysing the code snippets pasted below?

Figure 1:

 

Figure 2:

 

Top 10 Contributor
1,905 Posts

Check your ApplicationSettings class, or cfg.Configuration.  You likley switched build flavors and your Files, etc. are in the \Demo folder.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts

We have not made any changes either to ApplicationSettings Class or cfg.Configuration Table. Is it possible for you to assist us through skype to setup arachnode for crawling dynamic pages? If yes, please let us know your available time and skype id by sending an email to [email protected]. This will help us to quickly decide on purchase of your product.

Top 10 Contributor
1,905 Posts

I will make a video tonight using the demo code showing you how to crawl Dynamically.  I do not have Skype.

Please provide me with an AbsoluteUri that contains dynamic content.  (e.g. http://amazon.com/etc.)

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

replied on Wed, Sep 22 2010 10:20 AM

Hi Mike,

Thank you for the reply.

You can show us a demo on this Website http://www.carsales.com.au/

Thanks,

Yathish

 

 

Top 10 Contributor
1,905 Posts

I wouldn't use Dynamic (AJAX and forms submission) for this webpage.  The Dynamic modes for AN are for use in 1.) submitting form variables to a website and 2.) rendering AJAX content.

I would learn how to create query strings and submit direct requests for data.

Like this: http://www.carsales.com.au/all-cars/results.aspx?PriceTo=442&Ntt=red&tsrc=allcarhome&keywords=red&N=1216+1246+1247+1252+1282+4294967249+461+442&PriceFrom=461&Ntk=CarAll&Dx=mode+matchany&Nne=15&Ntx=mode+matchallpartial&D=red

You can change a portion of the query string to the correct parameters, and vary your searches this way.

http://www.carsales.com.au/all-cars/results.aspx?PriceTo=442&Ntt=red&tsrc=allcarhome&keywords=red&N=1216+1246+1247+1252+1282+1090+461+442&PriceFrom=461&Ntk=CarAll&Dx=mode+matchany&Nne=15&Ntx=mode+matchallpartial&D=red

So, if you have the query strings available, don't use the Dynamic mode.  Unless, of course, you can't figure it out, but this scheme doesn't look that difficult.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Been an exhausting day, today.  I do recommend that you do not use the IE DOM for this site, but if you do have another one that may be more appropriate I can take a look and give you my recommendations, likely Saturday morning.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (13 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC