arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Full-Text Indexing / Storage Question

rated by 0 users
Not Answered This post has 0 verified answers | 3 Replies | 3 Followers

Top 150 Contributor
3 Posts
infogg posted on Sat, Jul 17 2010 7:44 AM

Hi

 

Firstly, just wondering how is full-text indexing implemented, where is the text from HTML pages crawled stored in the DB?  After crawling successfully I couldn't find which tables the actual content text is stored in.

Secondly, features state that storage can be "to SQL 2005/2008 and/or to disk", so just wondering how does this work (which parts of the code) - if I don't store to the DB and just use files, how is indexing done, would be helpful to know more about this architecture

 

thanks

 

John

 

All Replies

Top 10 Contributor
1,905 Posts
arachnode.net replied on Sat, Jul 17 2010 11:28 AM

Hello there!  Not many questions about how SQL Full Text Indexing (FTI) is implemented, so I am glad you asked.  And, for anyone that doesn't know, FTI in SQL 2008 is vastly improved over 2005 and worth a second look.

The code in [Console]\Program.cs, and the database table cfg.Configuration instructs the crawler to NOT store the WebPage Source in SQL, rather to store in on disk.  (The configuration shown below is slightly different than the default, per my own project requirements.)

ApplicationSettings.InsertWebPageSource = false;

 ApplicationSettings.SaveDiscoveredWebPagesToDisk is the SQL to DISK complement.

If I set ApplicationSettings.InsertWebPageSource = true; and crawl for a bit, and then examine the WebPages table I get this:

 Where the Source was '0x' before making the configuration change, now varbinary Source is present.

Next, examine the Full Text Index catalogs.

Examine the properties of the arachnode.net_ftc_WebPages catalog.

By default, the AbsoluteUri and ResponseHeaders columns are indexed.  By adding data to the Source column, the Source column now becomes searchable.

The default configuration is to index to disk, which is accomplished by [Plugins]\ManageLuceneDotNetIndexes.cs.

Helper code in [Console]\Program.cs may have set your on-disk indexing directories:

Also, the LUKE toolbox is included in the source.

Worth reading: http://arachnode.net/blogs/arachnode_net/archive/2010/04/08/adding-custom-fields-to-the-lucene-net-indexes.aspx

This details how plugins (ManageLuceneDotNetIndexes.cs) are used to extend functionality: http://arachnode.net/Content/CreatingPlugins.aspx

Tags: http://arachnode.net/tags/lucene.net/default.aspx

Tags: http://arachnode.net/tags/plugins/default.aspx

Does this answer your question?

Sincerely,
Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
3 Posts
infogg replied on Sat, Jul 17 2010 1:39 PM

Hi

 

That's great info, very helpful, thanks!

 

So can use either Lucene.Net or SQL FTS out-of-the-box.

 

Just wondering then, in your opinion which is the better method, in terms of query performance, maintaining the indexes/data, flexibility?  What was the main reason for using Lucene.Net as the default config instead of SQL FTS? 

 

From your links, Lucene looks to be very powerful, is it that this would the be the large-scale option whereas FTS in SQL would be better / simpler for smaller projects?

 

thanks

 

John

 

Top 10 Contributor
1,905 Posts

You can use both, if you are so inclined.

I have found that FTI is slightly faster for query performance when indexing large volumes of text.  Although I do run machines with a large amount of RAM, and very fast disk arrays, so I may not be hitting a performance peaks.

Anything that Lucene.NET can do you can do with SQL, and the debate on which is better at date range queries, etc., and all of the other querying options could fill volumes.

iFilters are 'neat', but need to be in place when docs are added, won't retroactively re-index older docs - this may have changed in 2008.

I find FTI much easier to manage, especially since it has a GUI.  Lucene takes progressively longer and longer to incorporate new documents according to larger numbers of documents already in the index as a result of their 'undo' buffer.  FTI doesn't.  It took quite a bit of time to figure out what the right settings for index merging were, how to support continuous crawling and indexing (something I gave up on with Nutch) and with the last update, quite a bit of functionality changed, so I don't know that Lucene.NET has really gelled just yet.  Both suffer from index rebuilds when you need to add a field.  I trust the indexing (actual SQL indexes, clustered, non-clustered) for sorting performance more in FTI.

My preference would be to use FTI exclusively, if MS would provide a better 'under the hood' way of creating distributed nodes, like UNIONS across servers without requiring SP modification or explicit table modification (GUIDS) for replication.  But, so many, many people use Lucene.NET, and adopted it widely due to FTI not really being up to par until 2008.  In 2008 the entire engine was reworked and MS did a great job.

Lucene is easily modified to use multi-segment searchers... about as much work as it would take to configure your sp's to use linked servers.

The main reason for using Lucene.NET is a wider swath of adoption, and SQL isn't particularly good at managing varbinary data regarding the amount of time it takes to rebuild indexes when you have 500M rows containing varbinary data.  Just takes way too long to reindex/rebuild.  Crawling speed is faster if you store the source on disk.

AN has a ton of options, and some options will work for your crawling scenarios, and some won't.  I have one customer who crawls in Rackspace Cloud, at a rate of 1M pages an hour, and he inserts the WebPage Source.  Some of my projects insert the Source, and some do not.  It just depends on what you want to do.  The same is true for any indexing implementation.  I have worked with the two options in AN, FAST, SOLR (Lucene), SearchServer, D(mind blank), Autonomy, and have even rolled my own... all problems required different solutions.

Also, I bet that SQL will get a nice big boost in the next rev, as MS purchased FAST.

This is worth reading: http://stackoverflow.com/questions/499247/sql-server-2008-full-text-search-fts-versus-lucene-net

Not bad: http://www.sqlmonster.com/Uwe/Forum.aspx/sql-server-search/2295/SQL-server-FTS-vs-Lucene-NET

Hope this gives you some good info.

Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC