arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Few Questions

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 2 Followers

Top 25 Contributor
23 Posts
JCrawl posted on Tue, Mar 31 2015 9:13 PM
Hello So far, by far the best spider I have seen. I have a few questions... 1) Is it possible to setup AN to run on multiple machines sharing the same cache? Is there any guide for this? 2) How can you restrict the bot to only crawl a single site but still be able to download images etc from a CDN? 3) Is there a setup guide to use proxies. 4) Does AN know when a proxy/IP is burnt/invalid and not to use it anymore. Also small Tip: For anyone setting up the project and running it for the first time, Most of the directory locations that need to be set, are found in the configuration table however there is one directory in the CrawlActions table that took me a while to find. The setting is for Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes. It is by default looking for the M:\ drive... Jason

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by JCrawl

Jay:

Thank you!  arachnode.net (AN) has been around in one form or another since 2004.5...  there are thousands of users and thousands and thousands of hours into the work.  I really appreciate the kind words.

1.) It is - if you are planning to share a DB the config depends on do you intend to share the DB simultaneously with unrelated Crawls?  If so, look at the ApplicationSettings.UniqueIdentifier.  This setting append a QueryString to each AbsoluteUri, so be sure to select something unique for each crawling instance r group of machine which will logically crawl together.  If you intent for each machine to participate in the same crawl then just point each machine at the central database.  You'll want to make one small change to the Engine...

...always return 'false' from this an the Crawler instances will keep trying to pull from the DB.  They will always be running.

Or, depending upon your source - you may elect to create an EngineAction, which will load from a non-SQL source (should you choose)...

What are you trying to crawl?  Give as much detail as possible, please?

2.) How to restrict?  You want to set the RestrictCrawlTo in the CrawlRequest Constructor to Domain | Host, depending upon exactly where in a site you want to restrict the Crawl to - look at creating a Plugin - a PreAndPostRequest CrawlRule - this way you can set IsDisallowed on everything that doesn't match your site's Domain | Host or the CDN.  Look at UserDefinedFunctions.ExtractDomain and UserDefinedFunctions.ExtractHost.  Work through this tutorial: https://arachnode.net/Content/CreatingPlugins.aspx

3.) Yes, just drop them into ProxyServers.txt - Address:Port - then, look at private static void LoadProxyServers() in Console\Program.cs - here they go through an initialization routine - including a string to look for after fulfilling an HttpWebRequest from the intended Crawl target.

4.) Yes, look at private static void Engine_CrawlRequestCompleted(CrawlRequest<ArachnodeDAOMySQL> crawlRequest) in Console\Program.cs - Forbidden is the most common status returned.

Thanks for the tip: Yes, they are probably set to a non-standard setup - just run through the DEMO setup when you get a fresh checkout from SVN and all paths will be fixed up as they should.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by JCrawl

Jay:

Thank you!  arachnode.net (AN) has been around in one form or another since 2004.5...  there are thousands of users and thousands and thousands of hours into the work.  I really appreciate the kind words.

1.) It is - if you are planning to share a DB the config depends on do you intend to share the DB simultaneously with unrelated Crawls?  If so, look at the ApplicationSettings.UniqueIdentifier.  This setting append a QueryString to each AbsoluteUri, so be sure to select something unique for each crawling instance r group of machine which will logically crawl together.  If you intent for each machine to participate in the same crawl then just point each machine at the central database.  You'll want to make one small change to the Engine...

...always return 'false' from this an the Crawler instances will keep trying to pull from the DB.  They will always be running.

Or, depending upon your source - you may elect to create an EngineAction, which will load from a non-SQL source (should you choose)...

What are you trying to crawl?  Give as much detail as possible, please?

2.) How to restrict?  You want to set the RestrictCrawlTo in the CrawlRequest Constructor to Domain | Host, depending upon exactly where in a site you want to restrict the Crawl to - look at creating a Plugin - a PreAndPostRequest CrawlRule - this way you can set IsDisallowed on everything that doesn't match your site's Domain | Host or the CDN.  Look at UserDefinedFunctions.ExtractDomain and UserDefinedFunctions.ExtractHost.  Work through this tutorial: https://arachnode.net/Content/CreatingPlugins.aspx

3.) Yes, just drop them into ProxyServers.txt - Address:Port - then, look at private static void LoadProxyServers() in Console\Program.cs - here they go through an initialization routine - including a string to look for after fulfilling an HttpWebRequest from the intended Crawl target.

4.) Yes, look at private static void Engine_CrawlRequestCompleted(CrawlRequest<ArachnodeDAOMySQL> crawlRequest) in Console\Program.cs - Forbidden is the most common status returned.

Thanks for the tip: Yes, they are probably set to a non-standard setup - just run through the DEMO setup when you get a fresh checkout from SVN and all paths will be fixed up as they should.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
23 Posts

Hello Mike Thank you for all the information... 1) What is the difference between CrawlRequest and Discoveries? 2) When setting up CrawlRequests in the DB.. What is the differnce between AbsoluteUri1,AbsoluteUri2 and AbsoluteUri0. 3) do you know of a good tutorial on Lucene. As for Crawling, Attempting to crawl amazon for data and images... however.. they make it really hard to stay within a category. Do you know of any pluggins that assist in crawling amazon?

Lastly.. I really like the scraper, the browser tab is great for identifying sections of a site, the second tab "Path Filter" looks very promising however, after I grab the paths and setup the crawl and scrape actions.... the version I have does nothing. Is there a more recent version or am I missing something. The last 2 tabs are also blank. 

Thank you

 

Jason

Top 10 Contributor
1,905 Posts

1.) A CrawlRequest is anything you wish to Crawl, and a Discovery is a piece of internet content - look in the DB, anything with _Discoveries after is qualifies as a Discovery.  (WebPages do too...)

Here are some good links: https://arachnode.net/Content/FrequentlyAskedQuestions.aspx | https://arachnode.net/forums/p/739/11292.aspx#11292

2.) 0 is where it originally came from, and this may or may not be set depending upon what RestrictCrawlTo (CrawlRequest(...); Constructor) is set to. 1 is the parent.  2 is the child.

3.) This is the official reference for Lucene syntax: http://lucene.apache.org/core/3_6_2/queryparsersyntax.html  Other than this, I find stackoverflow.com really helpful: http://stackoverflow.com/questions/2297794/how-would-one-use-lucene-net-to-help-implement-search-on-a-site-like-stack-overf?rq=1

Crawling Amazon: I don't provide Plugins for specific sites at 1.) the site layouts change all the time - I would spend time every few days updating them... 2.) AN is a general purpose tool for data collection and analysis - I can't provide specific code to parse Amazon.com, if you know what I mean.  ;)

The Scraper: Not finished.

Thanks!
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC