arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Reset directories

rated by 0 users
Answered (Verified) This post has 1 verified answer | 7 Replies | 2 Followers

Top 75 Contributor
7 Posts
jrief posted on Thu, Jun 25 2015 11:01 PM

Is there a stored procedure or some other mechanism to change the location of the file directories?  This mechanism would come in handy when deploying to different boxes.

I know there are rows in the Configuration table for this, and at least one other place (related to the Lucene crawl action?).

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

These settings will override the DB settings (look in Console\Program.cs):

//_crawler.ApplicationSettings.DownloadedFilesDirectory = "";

//_crawler.ApplicationSettings.DownloadedImagesDirectory = "";

//_crawler.ApplicationSettings.DownloadedWebPagesDirectory = "";

//_crawler.WebSettings.LuceneDotNetIndexDirectory - this one controls the Lucene.Net index directory.

Consider using a DFS if you want to employ different machines.  The AN code stores the 'DiscoveryPath' in the index.  This value tells the search code where to find the 'on disk' version of the Discovery.  If this cannot be found the DB will be asked to provide the source of the Discovery.

https://en.wikipedia.org/wiki/Distributed_File_System_(Microsoft)

Benefits:

  • The paths stay the same - you won't have to worry about a local path applying to a deployed machine.
  • Reads - any machine, big or small can serve the requests.

Thanks!
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

These settings will override the DB settings (look in Console\Program.cs):

//_crawler.ApplicationSettings.DownloadedFilesDirectory = "";

//_crawler.ApplicationSettings.DownloadedImagesDirectory = "";

//_crawler.ApplicationSettings.DownloadedWebPagesDirectory = "";

//_crawler.WebSettings.LuceneDotNetIndexDirectory - this one controls the Lucene.Net index directory.

Consider using a DFS if you want to employ different machines.  The AN code stores the 'DiscoveryPath' in the index.  This value tells the search code where to find the 'on disk' version of the Discovery.  If this cannot be found the DB will be asked to provide the source of the Discovery.

https://en.wikipedia.org/wiki/Distributed_File_System_(Microsoft)

Benefits:

  • The paths stay the same - you won't have to worry about a local path applying to a deployed machine.
  • Reads - any machine, big or small can serve the requests.

Thanks!
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
7 Posts
jrief replied on Fri, Jun 26 2015 12:53 AM

I'm seeing this in Program.cs:

                //README: This directory will not contain files unless 'OutputConsoleToLogs' is 'true', and 'EnableConsoleOutput' is 'true'.

                string consoleOutputLogsDirectory = Path.Combine(Environment.CurrentDirectory, "ConsoleOutputLogs");

 

                //README: These directories are necessary if you wish to store downloaded Discoveries on disk, and not in the database.

                string downloadedFilesDirectory = Path.Combine(Environment.CurrentDirectory, "DownloadedFiles");

                string downloadedImagesDirectory = Path.Combine(Environment.CurrentDirectory, "DownloadedImages");

                string downloadedWebPagesDirectory = Path.Combine(Environment.CurrentDirectory, "DownloadedWebPages");

                string luceneDotNetIndexDirectory = Path.Combine(Environment.CurrentDirectory, "LuceneDotNetIndex");

This is resetting the value in the database, where I updated the values with the DFS directories.   My understanding is the settings come from the database first, then get overriden in Program.cs.    But this looks like it is always resetting it to the environment's current directory by this code:

                    //update the config values for the on-disk storage.

                    arachnodeDAO.ExecuteSql("UPDATE cfg.Configuration SET [Value] = '" + consoleOutputLogsDirectory + "' WHERE [Key] = 'ConsoleOutputLogsDirectory'");

                    arachnodeDAO.ExecuteSql("UPDATE cfg.Configuration SET [Value] = '" + downloadedFilesDirectory + "' WHERE [Key] = 'DownloadedFilesDirectory'");

                    arachnodeDAO.ExecuteSql("UPDATE cfg.Configuration SET [Value] = '" + downloadedImagesDirectory + "' WHERE [Key] = 'DownloadedImagesDirectory'");

                    arachnodeDAO.ExecuteSql("UPDATE cfg.Configuration SET [Value] = '" + downloadedWebPagesDirectory + "' WHERE [Key] = 'DownloadedWebPagesDirectory'");

                    arachnodeDAO.ExecuteSql("UPDATE cfg.Configuration SET [Value] = '" + luceneDotNetIndexDirectory + "' WHERE [Key] = 'LuceneDotNetIndexDirectory'"); 

Top 75 Contributor
7 Posts
jrief replied on Fri, Jun 26 2015 12:55 AM

I would like these values to always come from the database and not have to change source code to change directories.

Top 10 Contributor
1,905 Posts

Yes, this is fine.

This code is helper code to automagically populate the DB with the location of the running source so the initial use/user (think: demo) doesn't have to drill into SQL Server to set configuration values.

Any machine running from your DB will use these values - this is why the ApplicationSettings.Downloaded* values are commented.

Feel free to change Console\Program.cs - this is intended to a helper application of how to use the libraries.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

This code is only executed when you use the helper code to reset the DB, which you likely won't be doing in a production environment...  of course...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
7 Posts
jrief replied on Fri, Jun 26 2015 2:04 AM

Got it.   I'm still researching our solution using your tool and want to reset the directories at times.  I added a method to get the values from the db in order to reset them.

The crawler is working with remote directories, but...    

The web project, which I'm using to do searches,  fails to open the lucene index because, apparently the Lucene.Net.Store.FSDirectory class can not "see" the remote directory.   Even mapping the remote directory does not work.

"Google" tells me to go with Solr, the distributed Lucene solution.  Are there any other solutions?

Top 10 Contributor
1,905 Posts

I didn't have any problems using a remote directory for crawling and searching.

I used a VM, Windows 7 x64, \\VM\C$ shared to C, net use Z: \\VM\C, used Z:\Index as the LuceneDotNetIndexDirectory.

This is probably a permissions problem with IIS - ensure the account in the AppPool has access to the remote share.

Lucene.NET will perform much faster with a local directory.

Also, if you are just tallying words in web pages, Lucene.NET might be too much additional overhead in terms of things to manage - take a look at the Documents table - you can store whatever you want in there and it is tied to the WebPages table as a first class citizen.

Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (8 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC