arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Crawling several sites with 1.2 version

rated by 0 users
Not Answered This post has 0 verified answers | 3 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Sun, Jul 26 2009 12:43 PM

Hello, I know that the questions below were asked before in the forums but this question is relevant to 1.2 versin I know that there were some changes and I want to be sure the answers are still the same.

So here it is, I want to collect data from several domains.
I want to collect a title, description, and a link.

Now my questions are:
1. How do I collect the data if the structure of the HTML is changed from one site to another?
Lets say that in AAA.COM I want to crawl only the pages called AAA.COM/movie?id=xxx
where in this page the HTML structure like:
<div id="title">some title</div>
<div id="description">description</div>

and on BB.COM I want to crawl only the pages called BBB.COM/series?number=xxx
where in this page the HTML structure like:
<h1>some title</h1>
<h2>description</h2>


If I want to crawl such different sites so actually I will need to write a new plugin and edit the rules for each and every site in seperate? (I hope not).

2. This sites changes every day several times. how can I recrwal for the new pages or pages that being updated? is there a configuration trick that can solve this issue?

Please advice on that.
Thank you.

All Replies

Top 10 Contributor
1,692 Posts

1.) You will need to write separate rules for each site, but one plugin will work.  Else, how would the plugin know what information you want to pull?  You can use UserDefinedFunctions.ExtractDomain or UserDefinedFunctions.ExtractHost to perform the filtering/switching.

2.) The easiest would be to Create a Crawler (new Crawler()) and then feed the Crawler the AbsoluteUris you want to crawl and do this on a timed cycle.  When the Crawl is complete, you could instantiate a new Crawler in the 'Engine_OnCrawlCompleted' event, which you would need to do to clear the Cache, and then start the crawling process over again.  Google solved the 'which pages are new/changed' problem with their SiteMaps file.  You could write an Engine plug-in that scanned for .RSS pages and gave those pages and the Discovered AbsoluteUris priority when crawling, but, if you start at the main page of a site, it is liklely that you will pick up an .rss feed that will provide new links for you to be crawled.  In reality, crawling a set of site will likely need to be tweaked per/to that set of sites.

Does this answer your questions?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

It does partially.

Is there a tutorial on how to write a new plugin and how to attach it the rules? what files are needed to be modified? give me the first direction and I wll debug there and see if it works ike I need it to happen.
Where do I save the data I want to extract, and in what pattern exactly? lets say I havetitle/description and link to store in the database, so what is the the table name that stores it, and what are the raws names?

About the recrawling, why arachnode do not use the HTML timestamp? each page on the internet has a timestamp that you can know when the page was last changed, it can be added to coviguration table and before scan a page it compares it to the current page on the databases.
Do I miss the point?

Thank you.

Top 10 Contributor
1,692 Posts

There isn't an explicit tutorial - but these are the steps...

1.) Find one of the existing plugins.  'Anonymizer.cs' is the simplest and shortest.

2.) Create a new class using the name of your choice.

3.) Examine the 'CrawlActions' database table and follow the present pattern.

That's it.

Content caching is in place, and is set in WebClient.cs and is managed by the OS/.net framework.  http://msdn.microsoft.com/en-us/library/system.net.cache.requestcachelevel.aspx 

I have comparing the .NET caching against explicit 'Last-Modified' slated for Version 1.3.  Although the way that it currenly caches looks quite good.

Thanks!
Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC