<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://arachnode.net/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Search results matching tags 'LUKE' and 'lucene.net'</title><link>http://arachnode.net/search/SearchResults.aspx?a=0&amp;o=DateDescending&amp;tag=LUKE,lucene.net&amp;orTags=0</link><description>Search results matching tags 'LUKE' and 'lucene.net'</description><dc:language>en-US</dc:language><generator>CommunityServer 2008.5 SP1 (Debug Build: 31106.3070)</generator><item><title>Re: Full-Text Indexing / Storage Question</title><link>http://arachnode.net/forums/p/1385/12802.aspx#12802</link><pubDate>Sat, 17 Jul 2010 18:28:41 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:12802</guid><dc:creator>arachnode.net</dc:creator><description>&lt;p&gt;Hello there!&amp;nbsp; Not many questions about how SQL Full Text Indexing (FTI) is implemented, so I am glad you asked.&amp;nbsp; And, for anyone that doesn&amp;#39;t know, FTI in SQL 2008 is vastly improved over 2005 and worth a second look.&lt;/p&gt;
&lt;p&gt;The code in [Console]\Program.cs, and the database table cfg.Configuration instructs the crawler to NOT store the WebPage Source in SQL, rather to store in on disk.&amp;nbsp; (The configuration shown below is slightly different than the default, per my own project requirements.)&lt;/p&gt;
&lt;p&gt;ApplicationSettings.InsertWebPageSource = false;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/7532.iwps.PNG"&gt;&lt;img src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/7532.iwps.PNG" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;ApplicationSettings.SaveDiscoveredWebPagesToDisk is the SQL to DISK complement.&lt;/p&gt;
&lt;p&gt;If I set ApplicationSettings.InsertWebPageSource = true; and crawl for a bit, and then examine the WebPages table I get this:&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/3247.HasSource.PNG"&gt;&lt;img src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/3247.HasSource.PNG" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&amp;nbsp;Where the Source was &amp;#39;0x&amp;#39; before making the configuration change, now varbinary Source is present.&lt;/p&gt;
&lt;p&gt;Next, examine the Full Text Index catalogs.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/3666.whichtables.PNG"&gt;&lt;img src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/3666.whichtables.PNG" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Examine the properties of the arachnode.net_ftc_WebPages catalog.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/2063.props.PNG"&gt;&lt;img src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/2063.props.PNG" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;By default, the AbsoluteUri and ResponseHeaders columns are indexed.&amp;nbsp; By adding data to the Source column, the Source column now becomes searchable.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/2642.results.PNG"&gt;&lt;img src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/2642.results.PNG" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The default configuration is to index to disk, which is accomplished by [Plugins]\ManageLuceneDotNetIndexes.cs.&lt;/p&gt;
&lt;p&gt;Helper code in [Console]\Program.cs may have set your on-disk indexing directories:&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/0511.whersavedto.PNG"&gt;&lt;img src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/0511.whersavedto.PNG" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Also, the LUKE toolbox is included in the source.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/2148.luke.PNG"&gt;&lt;img src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/2148.luke.PNG" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/2161.moreluke.PNG"&gt;&lt;img src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/2161.moreluke.PNG" border="0" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Worth reading: &lt;a href="http://arachnode.net/blogs/arachnode_net/archive/2010/04/08/adding-custom-fields-to-the-lucene-net-indexes.aspx"&gt;http://arachnode.net/blogs/arachnode_net/archive/2010/04/08/adding-custom-fields-to-the-lucene-net-indexes.aspx&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;This details how plugins (ManageLuceneDotNetIndexes.cs) are used to extend functionality: &lt;a href="http://arachnode.net/Content/CreatingPlugins.aspx"&gt;http://arachnode.net/Content/CreatingPlugins.aspx&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Tags: &lt;a href="http://arachnode.net/tags/lucene.net/default.aspx"&gt;http://arachnode.net/tags/lucene.net/default.aspx&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Tags: &lt;a href="http://arachnode.net/tags/plugins/default.aspx"&gt;http://arachnode.net/tags/plugins/default.aspx&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Does this answer your question?&lt;/p&gt;
&lt;p&gt;Sincerely,&lt;br /&gt;Mike&lt;/p&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;</description></item></channel></rss>