Which is the best way to start new crawl, while others are running? I have a site where users can input sites for crawling, and im going to start new crawls every 5 minutes. What is the best way of starting a new crawl, while another one is running? Should i just run a new instance?
1.) A new instance will consume 10-70 MB of RAM, so this is an option and probably the easiest to understand from a development/implementation perspective. There is a bit of startup time associated with this, and to best understand what is loaded and where the time is consumed, set up a Performance session and end program execution after 'Engine.Start' completes.
2.) Look at how the Application project calls 'BeginCrawl'. You can use this to spin up a new Crawl on demand. The Crawler/Engine work based on a PriorityQueue, and each Crawl has an associated PriorityQueue -> each Crawl crawls in order, like a browser would do, to minimize the chance that advanced crawling detection algorithms will flag your requests. Also, the Engine assigns CrawlRequests to each Crawl in bulk to minimize locking/blocking which results in Therefore, while you may submit a new CrawlRequests with a high priority, you will need to wait until the next round of assignment in the Engine by 'AssignCrawlRequestsToCrawls(...);' or call this method yourself (set to internal/public).
3.) Or, figure out what the maximum number of threads (X threads) one process can sustain on one machine and set up X instances of the Console/Service and have them pull from a shared Queue, like, pull only 1 CR's at a time from the Database (and don't forget to delete it right after, with an Engine action).
Take a look at these options, factor against the unknowns (what you are crawling, images/files?, and depth?) and let me know?
For best service when you require assistance:
Im only crawling the webpage. No files and no images. I have tried playing around with depth. Currently i think its 20, but im probably going to lower it.