arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release
Crawl facebook.com, twitter.com and linkedin.com...

First, download Fiddler Web Proxy if you haven't already, and start Fiddler.

http://www.fiddler2.com/fiddler2/

Second, log in to the site you want to crawl using a browser.

Find the cookie value passed to the site.  In this case, it's LinkedIn.

Navigate to Console\Program.cs and locate the following section of code:

Enter your cookie value in place of the one shown.

arachnode.net will use the values in Crawler.CookieContainer to log into sites that require a login, and will dynamically manage all cookie interaction.

Place a breakpoint in SiteCrawler\Crawl.cs at the location shown and view the crawlRequest.DecodedHtml property in Html view to ensure that you have logged into the site successfully.

Results for Facebook:

Facebook uses a large amount of javascript to render content.  To fully render the page you would use the Renderers.


Posted Thu, Jun 2 2011 3:53 PM by arachnode.net

Comments

arachnode.net wrote re: Crawl facebook.com, twitter.com and linkedin.com...
on Wed, Mar 13 2013 3:25 PM

Ensure that the cookie specified matches the Domain exactly.  facebook.com redirects to www.facebook.com.  Use 'www.facebook.com' as the Domain for the CookieManager.cs parameter.

An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC