Ok so I have done a crawl and have like 4000 hosts, 2500 domains, and 7600 web pages in the db.
When running the web/search.aspx page to test if that's working, I get back the following error doing a search:
no segments* file found in Lucene.Net.Store.FSDirectory@C:\AppWorkspace\arachnode.net\Console\LuceneDotNetIndex: files:
I do not have any files in that folder but of course I *DO* have them in the C:|AppWorkspace\arachnode.net\source\Console\LuceneDotNetIndex folder including a segments.gen and segments_1 file.
So which config entry is incorrect that I need to modify for this? An extra ..\ somewhere? I just pulled down what is in svn.
The error you've encountered means that lucene.net isn't pointed to a valid lucene.net index.
I believe you've got it right on pointing to the second path mentioned above.
My test lucene.net index is at S:\LuceneDotNetIndex.
There are two locations in arachnode.net that need to be set for lucene.net to function properly.
1.) In CrawlActions.config. This location instructs ManageLuceneDotNetIndexes.cs as to where the index is located or should be located. The current code in SVN has a relative path listed and this should function as downloaded. Mine is changed as I have the indexes on a seperate volume.
2.) In Web.config. This configuration setting enables Search.aspx, et. al. to return search results. I need to add Server.MapPath or a friendly exception or something to make it a bit clearer what you need to do to enable searching. (and remove the incorrect hardcoded path)
Let me know if this doesn't work for you.
For best service when you require assistance:
Super let me make these tweaks and see if I can get the search stuff working.
So, can you explain a bit what role arachnode is playing and what role lucene.net is playing? Are the indexes and other data created in the lucenedotnetindex folder dependant on the arachnode data being stored? Or is it two separate sets of data being created during the crawl, the db stuff for reporting and such and the lucene stuff for searching?
Thanks much for laying this out. Look forward to making these tweaks and seeing what's what.
I'm working on a response to your post at http://arachnode.net/forums/t/94.aspx that hopefully will fully explain the association between arachnode.net and lucene.net.
But, while I'm on this page...
1.) Are the indexes and other data created in the lucenedotnetindex folder dependant on the arachnode data being stored?
Somewhat. The option currently exists to not submit every content type except WebPages to the database. I have a feature slated to choose whether content discovered by arachnode.net is stored on disk, in the database, both, or not at all. So, you could use arachnode.net to crawl, and keep state, but not store any content. Or, you could duplicate the content in triplicate if you wanted to.
The settings *insert are how you elect to submit or not submit content to the database. Content types can be parsed (extracted) for use by plugins, but may not be required by your particular application to be stored. Certain settings are required when crawling to enable the lucene.net functionality. For example: 'extractWebPageMetaData' must be enabled, as this creates the on-disk content locations and data and also extracts the text of the WebPage for lucene.net indexing. But, you don't have to insert the WebPage meta data. I'd like to add a check when crawling to verify your current crawling configuration as certain configuration settings are non-sensical.
2.) Or is it two separate sets of data being created during the crawl, the db stuff for reporting and such and the lucene stuff for searching?
This is also correct. The data in the database is primarily for reporting and data mining, etc. Yet, you can search the data in the database with SQL Full Text Indexing. Full Text Indexing is enabled in 4 locations and arachnode.net keeps track of the proper full text index types so you can use IFilters to search .pdf's, .mp3's, etc. Look at the table 'DataTypes' to learn more about mappings between Content-Type (HTML Response Header) and FullTextIndexType.
Full Text Catalog Locations:
Hmmm, you said "For example: 'extractWebPageMetaData' must be enabled, as this creates the on-disk content locations and data and also extracts the text of the WebPage for lucene.net indexing."
I think the version I pulled down had that disabled but I don't remember for sure. I'll check that too as part of making sure I've got everything configured correctly for searching via lucene.
I may double-check the FTI stuff too and play with what kind of search results it returns. Not much control over that though.
It should be enabled. If it isn't then that would definitely be a reason for the lucene.net integration not working.
Look at Arachnode.SiteCrawler.Managers.WebPageManager to see what the property enables.
Here's the latest copy of Application.config: http://arachnodenet.svn.sourceforge.net/viewvc/arachnodenet/source/Configuration/Application.config?revision=55&view=markup
It's too bad that SQL FTI isn't better at what it does. While it is quite good at transparently managing its indexes, as you say, there isn't much control over the indexes and the results. The synonym functionality is helpful but limited in that it isn't possible to have more than one synonym roll up to more than one parent word - but, that's a completely different forum post altogether.
You are correct the webpagemetadata setting is enabled so I'm good there. I changed the two config file locations for the lucene index directory to be a hard-coded path, to see if that gets the search working. Looks like it's updating the files in that folder (doing a crawl now) so I'll wait for it to finish then see if things work.
Question: I assume I need to let the crawl complete and not kill it via Visual Studio otherwise the files may not get updated properly and the search might bomb. Correct?
Oh yes another quick question. There is not already any code in the build that allows for console or web entry of a new site to crawl - correct? I assume there's just sample code in the console app and I could tweak that to build a console or web submitter.
Which leads me to another (final) question for now: Ideally it might be nice for submissions to be able to go to a holding queue where they can be reviewed before being actually crawled. For example, a site may be building a search engine for a specific topic or set of sites and they'd want to limit crawls. I know, this would lead to any discovered domains needing to be queued for review as well which could become a large list.
That got it! Hard-coding the path to the lucene dot net index directory seems to have gotten it working. I'm thinking however that these lucene data files are completely re-created every time I do a crawl. Is this correct?
Clearing out the db via the SP and doing another run to check things out more...
All great questions!!! I'll answer these when I get home after band practice!!! :)
What do you play?
Guitar, bass, drums, keyboards. In this band I'm playing guitar. My wife is the singer. :D
I haven't had any problems interrupting a crawl. Lucene.net seems to do a good job maintaining state. You can start and stop crawls at will and you should still be able to search over the indexes.
If you want to insert CrawlRequests to the database from an API, create an instance of ArachnodeDAO.cs. Use the method: InsertCrawlRequest. This will insert CrawlRequests directly into the CrawlRequests database table.
You are right about priority in the database. Look at the stored procedure: dbo.arachnode_omsp_CrawlRequests_SELECT. Here would be the best place to perform initial filtering. Part of what you've touched on I plan to elucidate upon in our post about ordered crawling.