arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Customization of crawl operation, conditional and partial crawling

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 2 Followers

Top 150 Contributor
2 Posts
haimoon posted on Thu, May 13 2010 1:35 AM

Hello all,


We are in the process of evaluating Arachnode as a crawl solution to our new project.
To ensure that Arachnode is suitable to our requirements we wish to set some custom rules and actions.

Can anyone point me to where I can manage these rules/actions and it's format?

Haim

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

You are welcome.

When creating the CrawlRequest specify 'UriClassificationType.Domain' for 'RestrictCrawlTo'... this will restrict the Crawl to that Domain only.

If you wish to collect and store only specfic HTML elements, you would modify the Data property of the CrawlRequest, in a PostRequest CrawlAction, as the Data property is what is inserted into the database and/or saved to disk.

You could also let the HtmlAgilityPack parse th DOM and use xpath to select the HtmlElements you wanted to store, and store these elements in a separate table.

So, yes, what you want to accomplish can be done.

How can I assist you further?  If you purchase a license, I would be grateful to show you how to accomplish your goals.

Many thanks!

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

You bet.

cfg.CrawlActions and cfg.CrawlRules.  Most of the rules and actions are compiled into the SiteCrawler project, which is available with purchase.

(adding annotations shortly...)

watch?v=9ArfTupYjLY

I have a lot of new features coming shortly, including but not limited to: MULTI_SERVER_CRAWLING_AND_CACHING.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts

Hello

Thank you for the quick response.

How can I specify conditional and partial crawling to a specific host?

for example,

I wish to collect only specific HTML elements from each site,
and never to redirect the crawler outside of the site domain.

can that be done?

Haim

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

You are welcome.

When creating the CrawlRequest specify 'UriClassificationType.Domain' for 'RestrictCrawlTo'... this will restrict the Crawl to that Domain only.

If you wish to collect and store only specfic HTML elements, you would modify the Data property of the CrawlRequest, in a PostRequest CrawlAction, as the Data property is what is inserted into the database and/or saved to disk.

You could also let the HtmlAgilityPack parse th DOM and use xpath to select the HtmlElements you wanted to store, and store these elements in a separate table.

So, yes, what you want to accomplish can be done.

How can I assist you further?  If you purchase a license, I would be grateful to show you how to accomplish your goals.

Many thanks!

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC