arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

How to restrict crawl to single domain?

rated by 0 users
Not Answered This post has 0 verified answers | 7 Replies | 3 Followers

Top 50 Contributor
8 Posts
jaydeep posted on Thu, Feb 5 2009 4:57 AM

Hello Guys,

I am stuck if anyone have idea of it !!!

well i want to crawl only the links which belong to my domain e.g. http://www.mydomain.com , i want to crawl all the pages under http://www.mydomain.com like http://www.mydomain.com/1.aspx, http://www.mydomain.com/2.aspx etc... and one of the page contains link to http://www.yahoo.com but i do not want to crawl http://www.yahoo.com , i know there is configuration through Application.config in Configuration project but if I set createCrawlRequestsFromDatabaseHyperLinks to false than it crawls only one link which is http://www.mydomain.com but i want data from all my sub pages.

Can this be done?

I hope i am clear enough.

Thanks
JD

All Replies

Top 10 Contributor
1,905 Posts
arachnode.net replied on Tue, Feb 10 2009 2:33 PM

The easiest way to see how to crawl a single site is to make sure you're crawling only CrawlRequests, as set in Application.config.  (Don't create CrawlRequests from Database HyperLinks or Database WebPages.)

Submit a CrawlRequest with a depth of 4 (any deeper and you'll need to check out CrawlRules.config for the Depth CrawlRule, I believe) and be sure to set RestrictToUriHost to true.

This configuration will crawl until all content found at depth 4 is complete.  If you started a crawl like this at MSN you'd likely pick up 250,000 WebPages. 

Of course, there is likely to be more content to be crawled.  So, check out this forum post on how to restrict the entire system to a domain: http://arachnode.net/forums/p/103/367.aspx#367  You'll want to do this if you don't want to manually feed the CrawlRequests table with requests from your intended Domain as the database tables will contain HyperLinks and Files from other Domains.

If you think crawling a single domain is too complicated, would you make a forum post and make a suggestion.  My eventual aim for arachnode.net is to make it as accessible as possible.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
8 Posts

Hello Mike,

wow great, thaks for your reply. It certainly works.

Actually i will be having 5 websites to crawl and i want to them wo be crawled everyday. so perhaps i will use windows service for that.

I have two issues.

1)  After crawling one site i have 5 more sites to crawl and so how can i continue that after one has finished also as i want it to repeat that on everyday do i have to make entry in CrawlRequest table everyday (obviously does not make sense !!!) as it clears entry which we have made in it after cral starts for that.

2)  If in future http://www.mydomain.com has few modified pages or new added pages so if i will crwal that website again than what will happen to my old crawled data?

I hope i am clear enough.

and yeah i will wait for improvement in lucene search which you are working on :)

Thanks again
JD

Top 10 Contributor
1,905 Posts
arachnode.net replied on Wed, Feb 11 2009 9:41 AM
  1. After all of the CrawlRequests are crawled, if you have CreateCrawlRequestsFromDatabaseHyperLinks set to true or CreateCrawlRequestsFromDatabaseHyperLinks set to true then arachnode.net will create CrawlRequests from those AbsoluteUris.  There is an order in which CrawlRequests are processed.  a.) CrawlRequests b.) HyperLinks (not already in the WebPages table) and c.) WebPages.  (The next release adds the ability to re-crawl Files as well).  If you look at the stored procedure arachnode_omsp_CrawlRequests_SELECT you'll see that it selects the TOP 3000 rows from a UNION of the TOP 3000 CrawlRequests, HyperLinks and WebPages.  If you wanted to adjust this balance so that 2900 HyperLinks not in the WebPages table and 100 WebPages were submitted for crawling you could do this.  So, after all CrawlRequests are crawled, and all HyperLinks (not already in the WebPages table) are crawled then WebPages will be submitted for recrawling.  (and you're right... having to maintain the CrawlRequests table would be a chore.  Just for fun, submit a deep depth crawl and set desiredMaximumMemoryUsageInMegabytes to a low number and keep an eye on the CrawlRequests table.)
  2. No data is deleted from arachnode.net.  So, if your WebPages are re-crawled, if their Source has changed, the date values in the WebPages table will update.  The screenshot below shows WebPages that were re-crawled.

I updated the demo for the upcoming lucene.net functionality improvements: http://arachnode.net/Content/LiveDemonstration.aspx

Since I know you're waiting I'll make finishing a build top priority this weekend.  I can't guarantee that I'll finish as the reporting views and stored procedures have to be modified for Release 1.1 and they always take longer than I think they will.  Also, I've made changes to the DB structure and to the lucene.net index format/fields.  Will you need a DB conversion script and a lucene.net index conversion utility?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
8 Posts

Hello Mate,

Here my point is i had put two crawl request in CrawlRequests table and both were crawled but now they are no more in the table, now suppose i want to crawl them again now what should be done? because i am having 500 domains to crawl which i will put in CrawlRequests table, so my question is is there anything that can be done so that i do not have to enter 500 names again and again when i want to crawl them?

let me know if i am not clear!!!

Thanks for all your help
JD

Top 10 Contributor
1,905 Posts

OK, how about this: If you want to crawl 500 domains you would configure arachnode.net to restrict Crawls to those 500 domain only like the posts above describe how to do.  Then, make sure your settings in Application.config are set as shown.

The Crawl process works like this if you have the settings set as shown above: (this is simplified, removing how Files work)

  • Crawl all database CrawlRequests.  CrawlRequests generate HyperLinks and a WebPage.
  • Crawl all database HyperLinks.  HyperLinks generate WebPages.
  • Crawl all WebPages (again).

So, if you had 500 Domains let's say that those 500 CrawlRequests generated 500,000 HyperLinks.  Of those 500,000 HyperLinks, 100,000 of them actually belong to your 500 Domains.  Then, those 100,000 HyperLinks would be crawled and would generate 100,000 WebPages.  When all CrawlRequests are done crawling and all HyperLinks belonging to your 500 Domains have been crawled (are found in the WebPages table), then all WebPages are crawled.  When all WebPages are crawled, this is essentially the same as resubmitting the original 500 CrawlRequests, except that arachnode.net won't have to refilter the 500,000 HyperLinks that it finds.

Does this help?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts
megetron replied on Fri, May 22 2009 10:34 AM

I am not sure.

I have done everything you suggested and still for some reason the webpages table possess lots of records from http://sketchup.google.com

What I am missing? I changed configuration according to lots of other posts regarding single domain/several domains and still I cant. please help me understand what I am doing wrong,

and another thing, just a suggestion, maybe you can make a MODE feature. modes can be Single Site/Several Site/No Limits/By Country and other popular settings for popular settings.
Maybe lots of code changes needed for this, but If you can do one settings file to hold all settings (and not on a database) then you can allow download form this site a setting files, so every one can download exactly the template he needs. just an idea as a new user it will be nice, I am sure I won't need it when I will understand code whenever it will be :)

thank you. waiting for answer on my issue/

Top 10 Contributor
1,905 Posts

You may be encountering a known bug that saves CrawlRequests from other domains when the Crawl process is shut down.  I'm working on a new build that should be available soon.

Another user suggested the template idea - I'll keep that in mind.

If you are running from an official release - try downloading the code from the trunk.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (8 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC