arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release
Crawl facebook.com, twitter.com and linkedin.com...

First, download Fiddler Web Proxy if you haven't already, and start Fiddler.

http://www.fiddler2.com/fiddler2/

Second, log in to the site you want to crawl using a browser.

Find the cookie value passed to the site.  In this case, it's LinkedIn.

Navigate to Console\Program.cs and locate the following section of code:

Enter your cookie value in place of the one shown.

arachnode.net will use the values in Crawler.CookieContainer to log into sites that require a login, and will dynamically manage all cookie interaction.

Place a breakpoint in SiteCrawler\Crawl.cs at the location shown and view the crawlRequest.DecodedHtml property in Html view to ensure that you have logged into the site successfully.

Results for Facebook:

Facebook uses a large amount of javascript to render content.  To fully render the page you would use the Renderers.


Posted Thu, Jun 2 2011 1:53 PM by arachnode.net

Comments

arachnode.net wrote re: Crawl facebook.com, twitter.com and linkedin.com...
on Wed, Mar 13 2013 3:25 PM

Ensure that the cookie specified matches the Domain exactly.  facebook.com redirects to www.facebook.com.  Use 'www.facebook.com' as the Domain for the CookieManager.cs parameter.

An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC