arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release
Common Questions

I frequently receive mail/posts about how to do something similar to a request received this morning:

(1) Start a initial crawl of 1 website to retrieve page content, meta title, keyword, description etc and out bound links to other pages on the site as well as links to external websites.

(2) Additionally crawl any external websites found from the initial crawl, retrieving the same as point (1), including new domains found.

(3) Store this data in a database.

I then want to be able to report on the following to start, but expand further in the future...

(4) Outgoing links from a page.

(5) Incoming links to a page.

(6) Incoming/Outgoing link text (anchor text).

How is this accomplished?

1, 2, 3.) Work through the installation steps found here:  http://arachnode.net/Content/InstallationInstructions.aspx  Next, examine Console\Program.cs.  This file creates CrawlRequests from AbsoluteUris.  This crawl starts at abc.com, is allowed to crawl anywhere within a depth of '2'.  If the crawl does not discover WebPages or HyperLinks outside of the 'abc.com' domain, try increasing the depth or adding additional CrawlRequests.  In this example, the demo restricts the crawl to the first 15 HyperLinks found on a page.  We should have 15 pages in the WebPages table with a depth of '3'.  Notice the settings changes from the default demo installation in yellow.  We want to extract MetaData from our WebPages, and in the context of AN this means parsing the page to xhtml and extracting text content.  Additionally, we want to insert HyperLinks and HyperLinkDiscoveries so we can source our WebPages and report on the link graph.  We're electing to NOT store WebPages on disk since we'll be inserting the WebPage Source into the 'Source' column of the 'WebPages' table.  Finally, run the 'Console' project and let the crawl finish.  Reset the database when prompted to do so.

Full page content is stored in the 'Source' column of the 'WebPages' table.  For the purposes of this demo, a SQL query is used.  The source code in SVN contains methods to extract the full source.

Select CAST(Source as varchar(max)) From WebPages

The 'WebPages_MetaData' table contains valid xhtml to which we can issue XPATH queries.

 

--WP_MD extracts text for you...

Select Cast([TEXT] as varchar(max)) From WebPages_MetaData

 

--these functions perform what the WP_MD table contains... (use to determine keyword densities... :))

Select * From dbo.ExtractWords('These are words and words are all they are...', 0, 0)

--duplicate words removed...

Select * From dbo.ExtractWords('These are words and words are all they are...', 0, 1)

 

--here's how to get the data you want.  since the page is parsed to valid XHTML we can use xpath to get the elements we need...

Select Top 10 WebPageID, AbsoluteUri, [Xml].query('(.//*[local-name()="meta"][lower-case(@name)="keywords"])') From WebPages_MetaData wpmd join

WebPages wp on wpmd.WebPageID = wp.ID Order by WebPageID

Select Top 10 WebPageID, AbsoluteUri, [Xml].value('(.//*[local-name()="meta"][lower-case(@name)="keywords"]/@content)[1]', 'nvarchar(MAX)') [Content] From WebPages_MetaData wpmd join

WebPages wp on wpmd.WebPageID = wp.ID Order by WebPageID

Select Top 10 WebPageID, AbsoluteUri, [Xml].value('(.//*[local-name()="meta"][lower-case(@name)="description"]/@content)[1]', 'nvarchar(MAX)') [Content] From WebPages_MetaData wpmd join

WebPages wp on wpmd.WebPageID = wp.ID Order by WebPageID

--this is a good option if you don't want to code a plugin...

 

Another way to accomplish meta content extraction is to follow the Plugin instructions here: http://arachnode.net/Content/CreatingPlugins.aspx  Parse the crawlRequest.DecodedHtml using an .HTML to .XML parser (like the HtmlAgilityPack) and issue the xpath queries provided above.

4.) Select * From HyperLinks as hl JOIN HyperLinks_Discoveries as hld on hl.ID = hld.HyperLinkID

Where hld.WebPageID in (Select ID From WebPages Where AbsoluteUri = 'http://abc.com/')

 

Select * From HyperLinks as hl JOIN HyperLinks_Discoveries as hld on hl.ID = hld.HyperLinkID

Where hld.WebPageID in (Select ID From WebPages Where AbsoluteUri = 'http://abc.go.com/')

5.) Select * From WebPages as wp

JOIN HyperLinks_Discoveries as hld on wp.ID = hld.WebPageID

JOIN HyperLinks as hl on hl.ID = hld.HyperLinkID

Where hl.AbsoluteUri = 'http://abc.go.com/'

6.) The rest will come tomorrow - need rest for today...  :)

 


Posted Thu, Sep 20 2012 5:50 PM by arachnode.net
Filed under:
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC