arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does arachnode.net scale? | Download the latest release
AJAX/Dynamic Content

The question of "How do I interact with Dynamic/AJAX content" comes up enough such that it warrants a dedicated blog post.

The Renderers execute out-of-process as either instances of AxWebBrowser or as the headless (no drawing of graphical content) variant using core MSHTML.  The Renderers DO NOT use .Net WebBrowser control, which STILL holds on to private bytes and offers no control over downloaded content, thereby meaning no Proxy support, no filtering of downloaded content, and essentially means downloading the page and all content twice and alerting the site to your presence.  AN's Renderer handling also accounts for the SPECIFIC threading settings which enable the headless version of the Renderers (DEFAULT) to run and debug.  If you haven't coded COM extensively, you probably don't want to code this yourself.

  • Set the 'Renderer' Project as the start-up Project.
  • Set private bool _debugSingleAbsoluteUri = true;

  • Start with Debugging.
  • Click 'Test'.

Notice there are two tabs on the hosting control.

1.) Source - this is the Rendered HTML source.
2.) Display - this tab is populated with content if private bool _useAxWebBrowser = true; 

Notice the Facebook logo is missing from the 'Display' tab.  This is intentional - we want to Render ONLY the HTML as downloaded and restrict/block all other ContentTypes, passing the Rendered HTML back to arachnode.net to process as if we had downloaded the HTML via an HttpWebRequest or from the standard .NET WebClient.

The default (and preferred method) for Rendering is using the Render\HtmlRenderer.cs class.  This class encapsulates MSHTML and handles all of the COM interop.

The options for the AxWebBrowser (intended for DEBUG purposes) are set here:

The options for Render\HtmlRenderer.cs (default for Dynamic/AJAX Rendering) are set here:

You shouldn't need to change them.

  • Switch back to 'Console' as the startup Project.
  • Enable the Renderers in the Crawler Constructor.
  • Set private bool _debugSingleAbsoluteUri = false; (and be sure to Revert any changes you may have made...)

The site we have selected to Crawl is http://designboom.com

The Renderers run out of process, which is extremely important - imagine running an 8 hour crawl only to have it crash and unable to resume.  If a Renderer crashes another is instantiated and crawling continues uninterrupted.  (yes, even Chrome crashes...)

Messages are sent via an IPC MessageQueue.

  • Start the Console.

You'll see a Renderer present itself.  The Forms host is a standard Windows Form and can be hidden for production purposes, and DEBUGGING/LOGGING may be disabled as well.

If we enable the AxWebBrowser we can see the DEBUG output of what our Renderer brought down.  Notice no styles/images, and that arachnode.net is downloading those when handed back to the main arachnode.net crawling process.

DesignBoom is a popular type of site where new content is presented as the user scrolls down vertically.  How do we do this?

  • Switch back to the Renderers project as the startup project.

  • Check the settings.

As the Renderers communicate with the crawling process via Proxy we'll have to run our scrolling behavior from the Renderer processes.

I added a quick Async method to allow the host control to request new content - notice that the debugger visualizer has downloaded images - the Renderers do not, by default - the main crawling process does.  One critical element was omitted for the benefit of licensed users.

The next scroll will show different content...

[MORE TOMORROW - FINISH ME]

What about the case where you simply want to have a page rendered in order to process dynamic links?

Let's take this case into account: http://www.sparinvest.de/fund%20range/all%20funds.aspx

If we request this page via an HttpWebRequest we find plenty of links, but no .pdf documents.

  • Enable the Renderers.
  • Re-run the Crawl.

An interesting/unique observation about this particular example is that an iframe is placed in the DOM, and this iFrame is then modified by JavaScript to set the location.

Let's take a look at this location.  I am coding in the private static void Engine_CrawlRequestCompleted(CrawlRequest<ArachnodeDAO> crawlRequest) callback in Console\Program.cs just to keep things simple.

As the objects we use to communicate the rendered HTML are Remoting Proxies, they cannot be used directly, and only fields may be examined - this is a side effect/throwback to the limitations of COM/.NET Remoting/MarshalByRef.

However, we can customize what AN does inside of the Renderers with RendererActions...

After the DOM has completely Rendered (excluding images/ActiveX/frames) we have a chance to examine the DOM and return properties which would otherwise not be available cross-process due to the aforementioned COM limitations.

If we examine Console\Program.cs we show the crawlRequest.RendererMessage.PropertiesValues property has returned the 'src' of an iframe.  Remember, AN does not Render iframes by default - this is desired and the optimal behavior.

  • Resubmit this AbsoluteUri back to the Crawler.  (the best place to do so would be in a Plugin: https://arachnode.net/Content/CreatingPlugins.aspx)

Now, we have the .pdf files which were previously obfuscated by an iframe/JavaScript.

  • Check on disk - notice our cache of documents.

Remember, AN takes care of all of the cross-process Marshal wiring, Proxy support, Cookie support.

Another example: https://server.capgroup.com/capgroup/action/getContent/GIG/Europe/Ind-investors-DE-de/Landing/Introduction/DE

Always look for a non-Rendering solution:

  • Crawling is faster / no JavaScript compilation/execution.
  • UI Elements / Style Elements change much more frequently than back-end code.

We aim to collect the .pdfs from this site - we are presented with a Terms of Service.  Do we need to click this disclaimer to access the HTML we are interested in?  No.  The source is exactly the same before we click and after.

When we view the page after clicking 'Accept' we are presented with a list of .pdfs we'd like to collect.

If a human were to perform each interaction to collect each .pdf we would select Fund (Fond) and each Currency (Währung) and then click on each .pdf.  While this is an acceptable solution, Rendering each page is not necessary - in MOST cases there is a static way to interact with the WebPage (with or without basic HTML Rendering) - clicking buttons is seemingly easier, but layouts/css change much more frequently than database/backend code.

Select one of the options, say 'Capital Group Emerging Markets Debt Fund' :: Notice that when we do the AbsoluteUri does not change - JavaScript executes a partial page change - while this page is sent https, we can easily determine the callback location using Fiddler.

Let's take a look at the source.  Notice we have distinct integer values to work with.

From this list, per our business requirements, we have a list of AbsoluteUris which would/will be called from the UI when a user or process selects a value from the 'Funds' ComboBox.

As we know the option codes now, (Option 601, for example), and we have the callback (https://server.capgroup.com/capgroup/fundaction/microsite?method=loadLiteratureDocs&fundcode=601&currency=USD&audience=Ind-investors-DE-de&type=KIID), and we know our .pdf type and location (Ind-investors-DE-de & KIID (look at the Fiddler screenshot for the other possible types, AR, SAR, KIID, etc.) and the desried currency we have a templated list of links to submit for crawling.  If we click on a .pdf (watching Fiddler), we can discover which AbsoluteUri is responsible for serving the .pdf files.

Now, we know the document handler AbsoluteUri!

Putting all of this together, we can skip Rendering entirely, read the option values from the Option Tag to discover this site's internal fund coding scheme, notice that these values are submitted to a templated AbsoluteUri, that location 'DE-de' uses standard language codes, and from this we can derive a location which gives us specific .pdf file names: https://server.capgroup.com/capgroup/fundaction/microsite?method=loadLiteratureDocs&fundcode=601&currency=USD&audience=Ind-investors-DE-de&type=KIID

Now we have the exact file names, using the callback AbsoluteUri we discovered from Fiddler, submit these file names to the AbsoluteUri file handler and obtain our .pdf.

(https://server.capgroup.com/capgroup/action/openpdffile/KIID_CGEBLU_A_EUR_German(DE)_LU0174781335.pdf?method=openfile&filename=KIID_CGEBLU_A_EUR_German(DE)_LU0174781335.pdf)

If the .pdf links were presented as HTML, rather than as callback to a JavaScript redirect function, we surely would have used the Rendering capabilities like we did in the previous example.

 


Posted Wed, Apr 1 2015 9:16 AM by arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC