arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Starting new crawls while others are running

rated by 0 users
Answered (Verified) This post has 1 verified answer | 2 Replies | 2 Followers

Top 75 Contributor
7 Posts
aerpricecom posted on Mon, Jun 29 2015 8:15 AM

Hello

Which is the best way to start new crawl, while others are running? I have a site where users can input sites for crawling, and im going to start new crawls every 5 minutes. What is the best way of starting a new crawl, while another one is running? Should i just run a new instance?

Best regards,

Asger

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by aerpricecom

1.) A new instance will consume 10-70 MB of RAM, so this is an option and probably the easiest to understand from a development/implementation perspective.  There is a bit of startup time associated with this, and to best understand what is loaded and where the time is consumed, set up a Performance session and end program execution after 'Engine.Start' completes.

2.) Look at how the Application project calls 'BeginCrawl'.  You can use this to spin up a new Crawl on demand.  The Crawler/Engine work based on a PriorityQueue, and each Crawl has an associated PriorityQueue -> each Crawl crawls in order, like a browser would do, to minimize the chance that advanced crawling detection algorithms will flag your requests.  Also, the Engine assigns CrawlRequests to each Crawl in bulk to minimize locking/blocking which results in Therefore, while you may submit a new CrawlRequests with a high priority, you will need to wait until the next round of assignment in the Engine by 'AssignCrawlRequestsToCrawls(...);' or call this method yourself (set to internal/public).

3.) Or, figure out what the maximum number of threads (X threads) one process can sustain on one machine and set up X instances of the Console/Service and have them pull from a shared Queue, like, pull only 1 CR's at a time from the Database (and don't forget to delete it right after, with an Engine action).

Take a look at these options, factor against the unknowns (what you are crawling, images/files?, and depth?) and let me know?

Thanks!
Mike 

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by aerpricecom

1.) A new instance will consume 10-70 MB of RAM, so this is an option and probably the easiest to understand from a development/implementation perspective.  There is a bit of startup time associated with this, and to best understand what is loaded and where the time is consumed, set up a Performance session and end program execution after 'Engine.Start' completes.

2.) Look at how the Application project calls 'BeginCrawl'.  You can use this to spin up a new Crawl on demand.  The Crawler/Engine work based on a PriorityQueue, and each Crawl has an associated PriorityQueue -> each Crawl crawls in order, like a browser would do, to minimize the chance that advanced crawling detection algorithms will flag your requests.  Also, the Engine assigns CrawlRequests to each Crawl in bulk to minimize locking/blocking which results in Therefore, while you may submit a new CrawlRequests with a high priority, you will need to wait until the next round of assignment in the Engine by 'AssignCrawlRequestsToCrawls(...);' or call this method yourself (set to internal/public).

3.) Or, figure out what the maximum number of threads (X threads) one process can sustain on one machine and set up X instances of the Console/Service and have them pull from a shared Queue, like, pull only 1 CR's at a time from the Database (and don't forget to delete it right after, with an Engine action).

Take a look at these options, factor against the unknowns (what you are crawling, images/files?, and depth?) and let me know?

Thanks!
Mike 

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
7 Posts

Im only crawling the webpage. No files and no images. I have tried playing around with depth. Currently i think its 20, but im probably going to lower it.

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC