arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

How to crawl specific text from authenticated webpage

rated by 0 users
Answered (Verified) This post has 1 verified answer | 13 Replies | 2 Followers

Top 25 Contributor
25 Posts
Dinesh posted on Fri, Jun 7 2013 12:47 PM

Hi All,

I want to crawl a specific text from authenticated web page. For instance, I need to capture specific text from Google Plus web page. How should I pass the credentials and capture the required text

Thanks

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Use the _crawler.CredentialCache.

http://arachnode.net/search/SearchResults.aspx?q=authenticated

http://arachnode.net/search/SearchResults.aspx?q=CredentialCache

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Use the _crawler.CredentialCache.

http://arachnode.net/search/SearchResults.aspx?q=authenticated

http://arachnode.net/search/SearchResults.aspx?q=CredentialCache

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Sun, Jun 9 2013 10:52 AM

Hi Mike,

We followed the below link to capture specific content from web page (i.e. https://plus.google.com/+amazon/posts)

1. Followed the below link

http://arachnode.net/blogs/arachnode_net/archive/2011/06/02/how-to-crawl-facebook-com-twitter-com-and-linkedin-com.aspx

2. Modified program.cs like below

//custom cookie processing...

                CookieCollection cookieCollection = CookieManager.BuildCookieCollection("NID=67=Y55w4a6iypmHI_SYNqNSLKELDY53iHtst-2qv3s9yRS40_-kna7k-8LjEeB1yPMZ-NKoZJ3bUJ6xn9LfkCutEK8Yj040MkIvOA6-L_s_Xzz30cngXbUaSI-pdQuRNv4ZiTj-xgoS_6PzjWW-IqXwLsYMmjB_FLoUqw0dy2n2sS1ASIjhEgA8w6WYn9eI6Hf4y4LcCexl8ps8BwmDEurmQs3jqSxfvz_btCLVMepXGYjj3iMV; PREF=ID=e1d426a649ec3ee9:U=2c47c3b15a6ef019:FF=0:LD=en:CR=2:TM=1367823931:LM=1367951063:GM=1:S=TGP-ZOSD_yu01pJV; SID=DQAAAMsAAACGF_l_qmGGLU77h-hFGUU2n9rF0BxBuUprvXC2SV5jBDFeJCbwlwQ1dMjT1_DnYI4HJ6HvkUpTmmLJh8cwY4SyZOZVBx67mEOiVp7__eSBAPu07vQIG18MgQ_THKBxzVTwJY4Rb3NMaWvSfYmpa3rkJCx14QdsOZXuPAZ4QsqNUPYbCSN9CqxDUUgaBlA5aaCZE2C6SReAH6tRS96X7sqVFnQb7zsjcVonGrzhQD20t0fDXyaPc4YEhjdRHyR5Bzd1tg8ULR7wh5_TjZ-CPDFP; HSID=A4EWWgEnXo0OOGEJN; APISID=fdIFW4vR48s1ohwe/AqWXxMXSy6sWsyLiE");

                CookieContainer cookieContainer = new CookieContainer();

                CookieManager.AddCookieCollectionToCookieContainer("https://plus.google.com", cookieContainer, cookieCollection);

                _crawler = new Crawler(CrawlMode.DepthFirstByPriority, cookieContainer, null, false);

        #endregion

 But we are not able to captured required content from authenticated web page.

3. You mentioned like use  _crawler.CredentialCache. How should we use this, any example for this ?

4. If we crawl content more than one authenticated web pages (facebook, twitter and google plus). Do we need to write separate cookie collection for each ? Can you help us soon as possible ??


Top 10 Contributor
1,905 Posts

Sounds like you are doing everything right.

You can add other cookies to the cookie collection.  AN will automatically send the correct cookie.

Set to single thread and take a look at the crawl pipeline.  Take a look at the Fiddler Proxy - this will help determine what you may be missing.

So, you aren't being logged into your Plus account?  Are there any exceptions in the Exceptions table?

http://msdn.microsoft.com/en-us/library/system.net.credentialcache.aspx

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Mon, Jun 10 2013 12:24 PM

HI Mike,

Thanks for your quick reply,

I logged into Plus account while Fiddler running machine and took the cookie value from fiddler and used in program.cs as mentioned like above.

I didn't see any exceptions in exception table.

When I see the crawlRequest.DecodedHTMl data in HTML view, it is showing sign in web page. It means we are unable to log into Plus account using Cookie info that we passing from program.cs. Please do the needful help as soon as possible.

Top 10 Contributor
1,905 Posts

Capture a Fiddler session with the browser, then capture a Fiddler session with AN and compare - what is different?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Looking at the Fiddler session you'll notice that the headers differ when making the initial CONNECT request.

There is a bug in the .NET HttpWebRequest class: http://stackoverflow.com/questions/2378031/set-user-agent-header-during-http-connect-over-ssl  If you happen to find a fix for this, please let me know.

Use the Renderers, works fine.  Smile :: (in the CrawlRequest constructor...)

To get the appropriate Cookie value, use the Automator and grab the Cookie value from the TextBox below the WebBrowser control.  Click 'Validate' and select a Proxy first.  Press 'Enter' after you have entered the web address => http://plus.google.com - log in, grab the cookie value - set the cookie as you have been doing - enable the Renderers in the Crawler constructor - works fine.

Also, get latest from SVN - I fixed a bug (which you haven't encountered) - makes getting HTML from Google faster when debugging.

Thanks. 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Thu, Jun 13 2013 10:05 AM

Hi Mike,

What do you mean by Automator ? Is it a software App ??

Top 10 Contributor
1,905 Posts

I'm guessing it's something in the AN solution.  Wink 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Mon, Jun 17 2013 11:58 AM

Hi Mike, Many thanks for your help

I am able to grab the cookie value from Automator but how to enable/use the Renderers in the Crawler constructor ??


Top 10 Contributor
1,905 Posts

It's a constructor parameter named 'enableRenderers'.  Smile

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&ie=UTF-8#sclient=psy-ab&q=c%23%20constructor&oq=&gs_l=&pbx=1&fp=82526f73ae331c90&ion=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47883778,d.aWM&biw=1601&bih=901

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Mon, Jun 17 2013 12:29 PM

Hi Mike,

We know what is constructor...:) ...please don't  provide any Google links. We are expecting specific help regarding AN crawler not the general definitions.  I hope you can understand....:)...  I can see the below parameters are passing from program.cs

new CrawlRequest(new Discovery(absoluteUri2), depth, restrictCrawlTo, restrictDiscoveriesTo, 1, renderType, renderTypeForChildren)

You want me to pass another parameter enableRenderers(i.e. true) to constructor,  but how to assign it to crawlrequest in Crawlrequest construtor. I didn't see any  enableRenderers property for crawlrequest . Can you help on this ?

Top 10 Contributor
1,905 Posts

Look at Console\Program.cs.  (actually, please read this entire file)

You'll notice that setting 'enableRenderers' in the Crawler Constructor sets the two rendering parameters in the CrawlRequest constructor, per logic in Console\Program.cs  (renderType / renderTypeForChildren)

http://arachnode.net/search/SearchResults.aspx?q=Renderers

http://arachnode.net/search/SearchResults.aspx?q=RenderType*

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Tue, Jun 18 2013 2:18 PM

Hi Mike,

We are able to see the logging page (authenticated web page) of google plus in performaction method  but getting unexpected errors and creating unnessary windows which shown below screenshot. what could be the reason ?

. 

Page 1 of 1 (14 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC