arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Plugin help

rated by 0 users
Answered (Verified) This post has 1 verified answer | 20 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Sun, Aug 2 2009 1:24 PM

hello,

I wish  to a built a new plugin for certain logic I have in my mind.
I think that Templater.cs is the best sample for me because I wish to collect data from specific tags.

What Templater plugin do. I am not sure.

 Further more, can I choose where to store the data collected? lets say we crawl on pages and collects image utl, and its ALT tag, and I want to save each data in the database in a different column, is it possible?

Thanks for help.

Answered (Verified) Verified Answer

Top 10 Contributor
229 Posts
Verified by megetron

Thanks to Mike help I could figure out how to get much much better performance.

I have disable insert properties and save properties which I don't use, except InsertExceptions which i use for debugging.

all of the email, files and images support disabled too. I don't save webpages neither.

There is an option to use 100% of your CPU in a crawl when using MaximumNumberOfCrawlThreads. and increased DesiredMaximumMemoryUsageInMegabytes to perform better performance

 

All of this make much better results. The previous crawl without the new setting took me 3 hours to complete, where the new settings did in less then 10 minutes.

Thanks Mike.

All Replies

Top 10 Contributor
1,905 Posts

Templater is a piece of code that can look at a webpage and extract the 'meat' of the page - it can look at a blog site and tell you which xpath will select the main post, the titles, or looking at a forum site, which posts are the forum posts.  It basically solves a tough problem in web scraping - how do you ignore common elements and get right to the text you want to analyze.

You can use any of the plugins as a starting point - they all extend ICrawlAction.cs.  Anonymizer.cs is the simplest to look at, because it contains the least code.

Your best option to store alt tags, etc. is to create a new plug-in, a new database table, use the existing ArachnodeDAO, create a new stored procedure that submits to your database table and submit your alt tags, etc. from the plug-in.  The CrawlRequest.cs object contains CrawlRequest.Discovery.ID which is the ID of the Image (you'll need to check Discovery.Type to make sure it is an image) which you will need to submit to the database so you can associate which alt tags, etc. belong with what image.

Create the plug-in, and the database table, and post them here if you have questions.

Yeah - create your plug in, but without any database code - let's get that looking good first - and then I can help you with getting the data to the database.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Thanks.

I have created a new databas, and created the new structure that I wish to fill in using arachnode.net. also a store procedure created for that.
Did also created the DAL code to call the stored procede. where is the best place to store this code in?  I wish to call the stored procedure from inside the Templater.CS plugin after make sure the meat pulled exactly like I want to.

Further more, I am editing the Templater.cs plugin to fit my needds. so I added this to filter the scan the specific pages that relevant to this crawling on
PerformAction function:

 

if ( ((System.Net.HttpWebRequest)crawlRequest.WebClient.WebRequest).RequestUri.AbsolutePath.StartsWith("/movie-")==false )
    return;

But I still don't understand what this Templater should do. I know it is alpha version, but I just dont get it.
what this line is meant to do:

xpathInfos = GenerateXPaths(managedWebPage.HtmlDocument.DocumentNode,

string.Empty, xpathInfos);

Where the HtmlDocument object is null always.

and when execute GenerateXPaths function I recieve exception:

base {System.SystemException} = {"Object reference not set to an instance of an object."}

Thanks for your help.

Top 10 Contributor
1,905 Posts

The templater code is ALPHA code - not ready for widespread use - basically, if you can figure it out then you can use it, but if not, it's unsupported.  Smile

Create a new plug-in to suit your needs.

If you edit alpha code, and then I finish the plug-in, it is very likely that your changes won't merge with my code and it will break your plugin.

Where is your DAL code stored now?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

OK, I will try to figure this out without the templater. this plugin is exactly what I need, can you estimate when it will be on beta version so I can test it?

about the DAL code, I didn't put it in te project yet. just created all I need. I think that I will add it as you suggested on the arachnodeDAO class.

I have change the plugin so it will work better. on isDisallowed.cs function I added this line:

 

 

 

if (absoluteUri.Value != "http://domain.com/" && absoluteUri.Value.StartsWith("http://domain.com/movie-")==false

)

 

 

return true

;

When I remove "absoluteUri.Value != "http://domain.com/" && " from code (beacause I don't wish to scan the root page) the absoluteuri enters into disallowedabsoluteuris and the crawling stop. so...how can I change the condition to makes it work better without force crawling root folder ?

Top 10 Contributor
229 Posts

arachnode.net:

The templater code is ALPHA code - not ready for widespread use - basically, if you can figure it out then you can use it, but if not, it's unsupported.  Smile

maybe it can help make things quicker. html agility pack

http://stackoverflow.com/questions/100358/looking-for-c-html-parser

please have a look and let me know what you think of it.

Top 10 Contributor
1,905 Posts

arachnode.net already contains support for the HtmlAgilityPack - however, the HtmlAgilityPack is a HUGE memory hog and has an extremely negative impact on crawling rate.  If you can avoid it, don't use it.  If you have to use it, change the configuration setting for 'ExtractWebPageMetaData' in the Configuration database table.

Check out UserDefinedFunctions.ExtractText(...) to see if this will work for you.

If you need to see how the HtmlAgilityPack is implemented, find 'ExtractWebPageMetaData' in WebPageManager.cs.

Oh, and the HtmlAgilityPack does not do what Templater.cs does.

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

I don't have an estimate of when Templater.cs will be ready for production use - there are three new bugs that I need to address before I get to the Templater.

I would suggest not making changes to IsDisallowed, if you can create a plugin that does the same thing - it would be better to modify AbsoluteUri.cs, honestly.

I'm not 100% sure what you are trying to crawl.  Elaborate please?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

I created a new plugin, because the Templater doesn't work according what I need and edit the absolueuri . I use some of the Templater code without htmlagility pack. the simple xpath do the job.

if you are going for speed, you might also want to check out the Majestic-12 HTML parser. Its handling is rather clunky, but it delivers a really fast parsing experience

can we use this one?

 

I have complete the plugin, also there are some work to do, but it loks ok.

a question - when crawling a web page, I get an images file that I wish to save in external directory. I don't find this on lManageLuceneDotNetIndexes plugin. what exactly is that lucenet indexes? this is where all of the physical content is being kept? I can see folders for files pages and etc...I don't really need the files and webpages, but only a specific image file for each crawl request. how to achieve that?

Top 10 Contributor
229 Posts

After extracting ( using xpath ) the image location, how do I save it to local machine using AN?

Top 10 Contributor
229 Posts

arachnode.net:

Check out UserDefinedFunctions.ExtractText(...) to see if this will work for you.

 

why do use ExtractText where you can use htmlNode.InnerText ?

I tested a bit templater.cs. it is not working ofcourse but change line:

HtmlManager

 

.CreateHtmlDocument(crawlRequest.DecodedHtml, Encoding.Default);

to this:

managedWebPage.HtmlDocument =

HtmlManager.CreateHtmlDocument(crawlRequest.DecodedHtml, Encoding.Default);

You forgot to set the html documebt object.

Top 10 Contributor
1,905 Posts

Templater.cs is ALPHA code.

1.) HtmlAgilityPack is a memory hog, and should only be used when you absolutely need it.  It is used in the templater code because I need XPATH support.

2.) ExtractText does a much, much better job of stripping out tags than the HtmlAgilityPack does, and it's faster as well.

In the case of what you are trying to accomplish, would a regular expression work?  The Templater will be used when you have a large number of sites and is isn't feasible to write a large number of xpaths or regular expressions.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Got it!

Can you please advice how can I improve the performance changing the configuration table?

1. I crawl only 1 domain only (maybe some in parallel)
2. No new rules but the rules set to true by default.
3. dont want to crawl files
4. don't need images or files. extracting only text from html using htmlagility (xpath support).
5. After crawling and extracting the data throw it to a database.

What can I do to improve the performance. currently I am crawling on 5000 pages website and it just takes too long. 2 hours or so.

can you hlp me understan what properties to disable on configuration table? if any other tips I would like to know.

Thank you,

Top 10 Contributor
229 Posts
Verified by megetron

Thanks to Mike help I could figure out how to get much much better performance.

I have disable insert properties and save properties which I don't use, except InsertExceptions which i use for debugging.

all of the email, files and images support disabled too. I don't save webpages neither.

There is an option to use 100% of your CPU in a crawl when using MaximumNumberOfCrawlThreads. and increased DesiredMaximumMemoryUsageInMegabytes to perform better performance

 

All of this make much better results. The previous crawl without the new setting took me 3 hours to complete, where the new settings did in less then 10 minutes.

Thanks Mike.

Top 10 Contributor
1,905 Posts

You are very welcome!

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (21 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC