Firstly, just wondering how is full-text indexing implemented, where is the text from HTML pages crawled stored in the DB? After crawling successfully I couldn't find which tables the actual content text is stored in.
Secondly, features state that storage can be "to SQL 2005/2008 and/or to disk", so just wondering how does this work (which parts of the code) - if I don't store to the DB and just use files, how is indexing done, would be helpful to know more about this architecture
Hello there! Not many questions about how SQL Full Text Indexing (FTI) is implemented, so I am glad you asked. And, for anyone that doesn't know, FTI in SQL 2008 is vastly improved over 2005 and worth a second look.
The code in [Console]\Program.cs, and the database table cfg.Configuration instructs the crawler to NOT store the WebPage Source in SQL, rather to store in on disk. (The configuration shown below is slightly different than the default, per my own project requirements.)
ApplicationSettings.InsertWebPageSource = false;
ApplicationSettings.SaveDiscoveredWebPagesToDisk is the SQL to DISK complement.
If I set ApplicationSettings.InsertWebPageSource = true; and crawl for a bit, and then examine the WebPages table I get this:
Where the Source was '0x' before making the configuration change, now varbinary Source is present.
Next, examine the Full Text Index catalogs.
Examine the properties of the arachnode.net_ftc_WebPages catalog.
By default, the AbsoluteUri and ResponseHeaders columns are indexed. By adding data to the Source column, the Source column now becomes searchable.
The default configuration is to index to disk, which is accomplished by [Plugins]\ManageLuceneDotNetIndexes.cs.
Helper code in [Console]\Program.cs may have set your on-disk indexing directories:
Also, the LUKE toolbox is included in the source.
Worth reading: http://arachnode.net/blogs/arachnode_net/archive/2010/04/08/adding-custom-fields-to-the-lucene-net-indexes.aspx
This details how plugins (ManageLuceneDotNetIndexes.cs) are used to extend functionality: http://arachnode.net/Content/CreatingPlugins.aspx
Does this answer your question?
For best service when you require assistance:
That's great info, very helpful, thanks!
So can use either Lucene.Net or SQL FTS out-of-the-box.
Just wondering then, in your opinion which is the better method, in terms of query performance, maintaining the indexes/data, flexibility? What was the main reason for using Lucene.Net as the default config instead of SQL FTS?
From your links, Lucene looks to be very powerful, is it that this would the be the large-scale option whereas FTS in SQL would be better / simpler for smaller projects?
You can use both, if you are so inclined.
I have found that FTI is slightly faster for query performance when indexing large volumes of text. Although I do run machines with a large amount of RAM, and very fast disk arrays, so I may not be hitting a performance peaks.
Anything that Lucene.NET can do you can do with SQL, and the debate on which is better at date range queries, etc., and all of the other querying options could fill volumes.
iFilters are 'neat', but need to be in place when docs are added, won't retroactively re-index older docs - this may have changed in 2008.
I find FTI much easier to manage, especially since it has a GUI. Lucene takes progressively longer and longer to incorporate new documents according to larger numbers of documents already in the index as a result of their 'undo' buffer. FTI doesn't. It took quite a bit of time to figure out what the right settings for index merging were, how to support continuous crawling and indexing (something I gave up on with Nutch) and with the last update, quite a bit of functionality changed, so I don't know that Lucene.NET has really gelled just yet. Both suffer from index rebuilds when you need to add a field. I trust the indexing (actual SQL indexes, clustered, non-clustered) for sorting performance more in FTI.
My preference would be to use FTI exclusively, if MS would provide a better 'under the hood' way of creating distributed nodes, like UNIONS across servers without requiring SP modification or explicit table modification (GUIDS) for replication. But, so many, many people use Lucene.NET, and adopted it widely due to FTI not really being up to par until 2008. In 2008 the entire engine was reworked and MS did a great job.
Lucene is easily modified to use multi-segment searchers... about as much work as it would take to configure your sp's to use linked servers.
The main reason for using Lucene.NET is a wider swath of adoption, and SQL isn't particularly good at managing varbinary data regarding the amount of time it takes to rebuild indexes when you have 500M rows containing varbinary data. Just takes way too long to reindex/rebuild. Crawling speed is faster if you store the source on disk.
AN has a ton of options, and some options will work for your crawling scenarios, and some won't. I have one customer who crawls in Rackspace Cloud, at a rate of 1M pages an hour, and he inserts the WebPage Source. Some of my projects insert the Source, and some do not. It just depends on what you want to do. The same is true for any indexing implementation. I have worked with the two options in AN, FAST, SOLR (Lucene), SearchServer, D(mind blank), Autonomy, and have even rolled my own... all problems required different solutions.
Also, I bet that SQL will get a nice big boost in the next rev, as MS purchased FAST.
This is worth reading: http://stackoverflow.com/questions/499247/sql-server-2008-full-text-search-fts-versus-lucene-net
Not bad: http://www.sqlmonster.com/Uwe/Forum.aspx/sql-server-search/2295/SQL-server-FTS-vs-Lucene-NET
Hope this gives you some good info.