arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Partial crawling

rated by 0 users
Answered (Verified) This post has 1 verified answer | 13 Replies | 3 Followers

Top 10 Contributor
229 Posts
megetron posted on Sun, May 3 2009 9:49 AM

Hello again,

Is there any option to make a partial crawling on specific site, and store to database only relevant data you wish to?

for example, lets say there is site WWW.AA.COM that I know that the pages content is lots of links and one <OBJECT> HTML code and text . The rest of the HTML code is only design code meant to make AA.COM a nice site for the visitors.

now, I don't wish to store on the crawled database the design HTML code. only the HTML code. I want the <OBJECT> code and the text , and maybe a JPG ot two.

If the whole website works according to a static structure, how can I actually get the specific data I really want to.

is it possible?

Answered (Verified) Verified Answer

Top 10 Contributor
1,692 Posts

Kevin is right on #1 - check out this file: http://arachnodenet.svn.sourceforge.net/viewvc/arachnodenet/trunk/SiteCrawler/Managers/WebPageManager.cs?revision=167&view=markup

On line 80, if you have 'ExtractWebPageMetaData' set to true in the Configuration table in the database, then AN will create an HtmlDocument (HtmlAgilityPack) which you can query with xpath to get the data you want.

Then, place line 47 after the block at line 49, so that you modify the web page source before submitting to the database.

internal void ManageWebPage(CrawlRequest crawlRequest)
   45         {
   46
   47            
   48
   49             if (crawlRequest.Discovery.ID.HasValue)
   50             {
   51                 ManagedWebPage managedWebPage = ManageWebPage(crawlRequest.Discovery.ID.Value, crawlRequest.Discovery.Uri.AbsoluteUri, crawlRequest.Data, crawlRequest.DataType.FullTextIndexType, ApplicationSettings.ExtractWebPageMetaData, ApplicationSettings.InsertWebPageMetaData, ApplicationSettings.SaveDiscoveredWebPagesToDisk);
   52
   53                 crawlRequest.ManagedDiscovery = managedWebPage;
   54             }

//line moved   crawlRequest.Discovery.ID = _arachnodeDAO.InsertWebPage((int) SubmittedBy.User, crawlRequest.Discovery.Uri.AbsoluteUri, crawlRequest.WebClient.ResponseHeaders.ToString(), ApplicationSettings.InsertWebPageSource ? crawlRequest.Data : new byte[] {}, crawlRequest.DataType.FullTextIndexType, crawlRequest.Depth);

   55         }

Does this make sense?  Feel free to ask any additional questions.

(A caveat is that HtmlAgilityPack is a memory hog...)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, May 4 2009 7:54 AM

Definitely possible - question is how much work is it worth for you?  Here's some ideas, and maybe some others will post as well.

1. Leave Crawler as crawler and customize outside using tool like HtmlAgilityPack - so let the crawler collect the pages you want to collect, and opt to store them either in the db or in disk files.  Then, outside the crawler itself, write code to parse that data and collect the nuggets you want.  For example, use a tool like HtmlAgilityPack to very easily go grab object tags from the source and do what you want with them.

2. Harder but more elegant.  Write a custom crawler plug-in like the lucenedotnetindex plug-in, and have your custom crawler parse out what you want and only store that during the crawl.  This is really only harder in terms of finding good samples to do it, but I wonder if looking at the lucenedotnetindex plug-in could be used as a model.

Does that help any?

 

Top 10 Contributor
1,692 Posts

Kevin is right on #1 - check out this file: http://arachnodenet.svn.sourceforge.net/viewvc/arachnodenet/trunk/SiteCrawler/Managers/WebPageManager.cs?revision=167&view=markup

On line 80, if you have 'ExtractWebPageMetaData' set to true in the Configuration table in the database, then AN will create an HtmlDocument (HtmlAgilityPack) which you can query with xpath to get the data you want.

Then, place line 47 after the block at line 49, so that you modify the web page source before submitting to the database.

internal void ManageWebPage(CrawlRequest crawlRequest)
   45         {
   46
   47            
   48
   49             if (crawlRequest.Discovery.ID.HasValue)
   50             {
   51                 ManagedWebPage managedWebPage = ManageWebPage(crawlRequest.Discovery.ID.Value, crawlRequest.Discovery.Uri.AbsoluteUri, crawlRequest.Data, crawlRequest.DataType.FullTextIndexType, ApplicationSettings.ExtractWebPageMetaData, ApplicationSettings.InsertWebPageMetaData, ApplicationSettings.SaveDiscoveredWebPagesToDisk);
   52
   53                 crawlRequest.ManagedDiscovery = managedWebPage;
   54             }

//line moved   crawlRequest.Discovery.ID = _arachnodeDAO.InsertWebPage((int) SubmittedBy.User, crawlRequest.Discovery.Uri.AbsoluteUri, crawlRequest.WebClient.ResponseHeaders.ToString(), ApplicationSettings.InsertWebPageSource ? crawlRequest.Data : new byte[] {}, crawlRequest.DataType.FullTextIndexType, crawlRequest.Depth);

   55         }

Does this make sense?  Feel free to ask any additional questions.

(A caveat is that HtmlAgilityPack is a memory hog...)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Love it. I will give it a try. according to this, I guess that I will need to change the code for each and every different web site due to different structure.
Correct?

Can you give xpath examples that I can query for?

Lets say I want to crawl on youtube, and I know that all pages are structured all the same in this URL: http://www.youtube.com/watch?v=U3ltjPODZYM&feature=popular

In this page I will find :
1. 1 OBJECT tag .
2. 1 title ("Chiara's first rehearsal (impression) at the 2009 Eurovision Song Contest")

What kind of xpath I should query to collect and store on database the specific date, and ignore rest of it.
What is the indicator to know which one is the title on the page and which is a regular text?

Thanks for help.

Top 25 Contributor
16 Posts

Dear author,

thank you for your great work! 

i want to know if arachnode.net can crawl  not only  html , jpg ,email address ,but also pdf ,doc format  file, etc


if it can ,where to set the related options  , thank you for your reply!

Top 25 Contributor
16 Posts

i am an novice at arachnode.net , i want to crawl specific content such as title , author , abstract  from the internet site  named  http://www.sciencedirect.com,  detailed url as follows:

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B7576-4TK47D8-1&_user=10&_coverDate=08%2F31%2F2009&_rdoc=2&_fmt=high&_orig=browse&_srch=doc-info(%23toc%2312890%232009%23999929993%231040073%23FLA%23display%23Volume)&_cdi=12890&_sort=d&_docanchor=&_ct=18&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=85537eb8364a9f16742aeefe07a2aad3

1.where to set target url to be crawled  and how to extract the specific field content? 

2.where to find console app about crawling  in this project and run?

3.how to save the field content of  title , author , abstract?

4. how to index the content saved in Step 3?

thank you for your reply in advance!!

 

Top 10 Contributor
1,692 Posts

I can answer these tonight.  Thanks for your patience.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,692 Posts

Yes.  You will need to apply a specific xpath to specific hosts and directory paths.

1.) I'm not seeing an explicit OBJECT tag.  Do you mean one that is rendered by JavaScript?

2.) /html/body/div[@id='baseDiv']/div[@id='watch-vid-title']/h1

Check out 'xpather' for FireFox: https://addons.mozilla.org/en-US/firefox/addon/1192

Title is fairly easy as you can query the HtmlDocument for the 'title' tag.  You can even use a regular expression to do this.  See ManageLuceneDotNetIndexes._title for an example of how to do this.

Regular text is more difficult and we have code in the works that programatically/automatically tells you which text is the 'meat' of the page.  No ETA on when this will be released though.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,692 Posts

You are very welcome.

It can, but indexing is currently limted (from the lucene.net side, not the SQL FTI side) to text.

Check the table AllowedDataTypes.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

arachnode.net:

Check out 'xpather' for FireFox: https://addons.mozilla.org/en-US/firefox/addon/1192

Great tool. thanks. I wonder, when I extract the path using xpather, there are some TBODY tags, but in order to using HtmlDocument object to extaract the xpath I must remove the TBODY tags from the xpath and only then the path was found. why is that?

Top 10 Contributor
229 Posts

Hi,

Needs help with xpath. how do I retrieve embed attrivute's value? (<embed src="somefile.ext")

I tried this: path=/html/body/embed/@src and this path=/html/body/embed[@src]

xpather dont know how to handle with emeded flash files.

 

Top 10 Contributor
1,692 Posts

Which AbsoluteUri?  http://???

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

lets say I want to pull from embed object :
<embed style="WIDTH: 532px; HEIGHT: 346px" pluginspage="http://www.macromedia.com/go/getflashplayer" src="http://wwwstatic.megavideo.com/mv_player.swf?image=http://www.10net.co.il/image/users/108752/ftp/my_files/8887777.png&amp;v=XFPX1I6Z" width="456" height="311" type="application/x-shockwave-flash" allowfullscreen="true" play="true" loop="true" menu="true"></embed>

the string inside the src attribute http://wwwstatic.megavideo.com/mv_player.swf?image=http://www.10net.co.il/image/users/108752/ftp/my_files/8887777.png&amp;v=XFPX1I6Z

and from this string I want to get only this string "v=XFPX1I6Z"

how can I achieve that? the absoluteuri is : http://www.10net.co.il/108752/%D7%9E%D7%99-%D7%90%D7%91%D7%90-%D7%A9%D7%9C%D7%9A-

 need help with that.

 

Top 10 Contributor
1,692 Posts

QUICK UPDATE: .doc/.pdf file indexing and search capabilities are coming in Version 1.4, to be released shortly.

(duplicate post de-identifier)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (14 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC