Hi I've managed have everything up and running with VS 2008 and SQL 2008.
I did the following:
Run console crawler for few minutesRun search.aspx and tried to search for something. I'm always getting 0 results.
My DownloadedWebPages is 130MB, but indexing folder is only 700 KB
What is the problem? Shouldn't it work out of the box, or I'm missing to configure something?
I changed default directory configuration by running
use [arachnode.net] update dbo.Configuration set Value = 'e:\myindex\index' where [KEY] = 'LuceneDotNetIndexDirectory' update dbo.Configuration set Value = 'e:\myindex\DownloadedFiles' where [KEY] = 'DownloadedFilesDirectory' update dbo.Configuration set Value = 'e:\myindex\DownloadedImages' where [KEY] = 'DownloadedImagesDirectory' update dbo.Configuration set Value = 'e:\myindex\DownloadedWebPages' where [KEY] = 'DownloadedWebPagesDirectory'
And I've also set luceneDotNetIndexDirectory="e:\trazilica\index" inside of CrawlActions.config
This is the result of dir /s
Directory of E:\myindex\index04/10/2009 12:02 PM <DIR> .04/10/2009 12:02 PM <DIR> ..04/10/2009 11:39 AM <DIR> CurrentCrawl04/10/2009 12:02 PM 0 output.txt04/10/2009 11:32 AM 20 segments.gen04/10/2009 11:32 AM 20 segments_104/10/2009 11:30 AM <DIR> Temp 3 File(s) 40 bytes Directory of E:\myindex\index\CurrentCrawl04/10/2009 11:39 AM <DIR> .04/10/2009 11:39 AM <DIR> ..04/10/2009 11:32 AM 20 segments.gen04/10/2009 11:32 AM 20 segments_104/10/2009 11:32 AM 0 write.lock04/10/2009 11:58 AM 704,512 _0.fdt04/10/2009 11:58 AM 16,384 _0.fdx 5 File(s) 720,936 bytes Directory of E:\myindex\index\Temp04/10/2009 11:30 AM <DIR> .04/10/2009 11:30 AM <DIR> .. 0 File(s) 0 bytes Total Files Listed: 8 File(s) 720,976 bytes 8 Dir(s) 229,438,779,392 bytes free
OK. Good to read that the code is being called. I'll grab a fresh checkout tomorrow and check that I get lucene results.
The 406 errors pertain to content that isn't allowed - like .mp3 files. Check the DisallowedAbsoluteUris table in the database, if you are curious.
You are welcome! :D
I grabbed a fresh copy of the code from SVN, restored the database and installed via these instructions:
1.) Download the code.2.) Restore the database backup.3.) Run the stored procedure '[dbo].[arachnode_usp_arachnode.net_RESET_DATABASE]'4.) Press F5 on the solution.
And, I have results in the default lucene.net index directory.
Results are automatically flushed to disk per the lucene.net MergeFactor. The method TearDownIndexWriter finanlizes the index at the end of the crawl, or when the crawl is shut down.
Next step is to see if AddDocument is being called. Or, try stepping through the PerformAction method. What happens?
For best service when you require assistance:
Double-check the lucene.net index with LUKE: http://www.getopt.org/luke/
How did you close the console application? Did you click the close button or press ctrl-c?
Is the CrawlAction enabled in CrawlActions.config?
Also, did you grab a release or the latest from SVN?
And I've also set luceneDotNetIndexDirectory="e:\trazilica\index" inside of CrawlActions.config This needs to match the other directory settings.
My index is empty nothing inside. I've tried with LUKE and didn't find anything. I was closing the app with ctrl-c and also by pressing return key
CrawlAcrtion is enabled. I also tried with rebuildIndexOnLoad=true. Same thing
I grabed latest code from SVN today (1.1) same thing.
I actually made mistake here luceneDotNetIndexDirectory points where it should be...
I'm running out of clues... Is it posible to see lucene log somehow?
Actually with the new code, and setting rebuildIndexOnLoad=true, after running console 2nd time, i get something in e:\myindex\index\temp
After I point web application and change global.asax.cs _indexSearcher = new IndexSearcher(@"E:\myindex\index\Temp\"); (just for testing purposes) I get results back.
The problem is that in oridinal index folder e:\myindex\index i have only two files segments.gen and segments_1 that are 20 Byte...
Is the ManageLuceceDotNetIndexes.cs code being called after each CrawlRequest? Set a breakpoint at the function 'PerformAction'. If the breakpoint doesn't hit, either the CrawlAction isn't enabled in CrawlActions.config or something else is wrong.
Is the CrawlAction enabled? It looks like it is in the SVN browser on SourceForge: http://arachnodenet.svn.sourceforge.net/viewvc/arachnodenet/trunk/Configuration/CrawlActions.config?revision=160&view=markup
Are there any exceptions in the Exceptions table?
Yes the code is being called, PerformAction was executed as well, also CrawlAction is fine and set to true.
Nothing Interesting in Exceptions table only bunch of
The remote server returned an error: (406) Not Acceptable.The remote server returned an error: (404) Not Found.
I'm still not too familiar with entire code, but if I understand things correctly PerormAction first index things wtih CurrentCrawl but at some point and then things are merged with MergeCurrentCrawl to main index. But I see MergeCurrentCrawl executed only once and inside it i see only _indexWriter.Optimize(); no _indexWriter.Flush(); and close _indexWriter.Close(); How then CurrentCrawl come down to main index? I mean, if i force index rebuild by setting rebuildIndexOnLoad="true" in index/temp folder I get good index, but in any case files under index stay empty...
btw. Thanks for your help on this :)
I didn't have time to take a look at it yet, but will do that tonight for sure. Will let you know...
(doesn't it always seem like the toughest part of coding is configuring? :D)
hehe very true :)