arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Not geting any results from Lucene

rated by 0 users
Answered (Verified) This post has 1 verified answer | 9 Replies | 2 Followers

Top 75 Contributor
6 Posts
astrujic posted on Fri, Apr 10 2009 3:09 AM

Hi I've managed have everything up and running with VS 2008 and SQL 2008.

I did the following:

Run console crawler for few minutes
Run search.aspx and tried to search for something. I'm always getting 0 results.

My DownloadedWebPages is 130MB, but indexing folder is only 700 KB

What is the problem? Shouldn't it work out of the box, or I'm missing to configure something?

I changed default directory configuration by running

  use [arachnode.net]
  update dbo.Configuration
  set Value = 'e:\myindex\index'
  where [KEY] = 'LuceneDotNetIndexDirectory'
 
  update dbo.Configuration
  set Value = 'e:\myindex\DownloadedFiles'
  where [KEY] = 'DownloadedFilesDirectory'

  update dbo.Configuration
  set Value = 'e:\myindex\DownloadedImages'
  where [KEY] = 'DownloadedImagesDirectory'
 
  update dbo.Configuration
  set Value = 'e:\myindex\DownloadedWebPages'
  where [KEY] = 'DownloadedWebPagesDirectory'

And I've also set luceneDotNetIndexDirectory="e:\trazilica\index"
inside of CrawlActions.config

 

This is the result of dir /s

 Directory of E:\myindex\index

04/10/2009  12:02 PM    <DIR>          .
04/10/2009  12:02 PM    <DIR>          ..
04/10/2009  11:39 AM    <DIR>          CurrentCrawl
04/10/2009  12:02 PM                 0 output.txt
04/10/2009  11:32 AM                20 segments.gen
04/10/2009  11:32 AM                20 segments_1
04/10/2009  11:30 AM    <DIR>          Temp
               3 File(s)             40 bytes

 Directory of E:\myindex\index\CurrentCrawl

04/10/2009  11:39 AM    <DIR>          .
04/10/2009  11:39 AM    <DIR>          ..
04/10/2009  11:32 AM                20 segments.gen
04/10/2009  11:32 AM                20 segments_1
04/10/2009  11:32 AM                 0 write.lock
04/10/2009  11:58 AM           704,512 _0.fdt
04/10/2009  11:58 AM            16,384 _0.fdx
               5 File(s)        720,936 bytes

 Directory of E:\myindex\index\Temp

04/10/2009  11:30 AM    <DIR>          .
04/10/2009  11:30 AM    <DIR>          ..
               0 File(s)              0 bytes

     Total Files Listed:
               8 File(s)        720,976 bytes
               8 Dir(s)  229,438,779,392 bytes free

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

OK.  Good to read that the code is being called.  I'll grab a fresh checkout tomorrow and check that I get lucene results.

The 406 errors pertain to content that isn't allowed - like .mp3 files.  Check the DisallowedAbsoluteUris table in the database, if you are curious.

More later...

You are welcome!  :D

OK, back.

I grabbed a fresh copy of the code from SVN, restored the database and installed via these instructions:

1.) Download the code.
2.) Restore the database backup.
3.) Run the stored procedure '[dbo].[arachnode_usp_arachnode.net_RESET_DATABASE]'
4.) Press F5 on the solution.

And, I have results in the default lucene.net index directory.

Results are automatically flushed to disk per the lucene.net MergeFactor.  The method TearDownIndexWriter finanlizes the index at the end of the crawl, or when the crawl is shut down.

Next step is to see if AddDocument is being called.  Or, try stepping through the PerformAction method.  What happens?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Double-check the lucene.net index with LUKE: http://www.getopt.org/luke/

How did you close the console application?  Did you click the close button or press ctrl-c?

Is the CrawlAction enabled in CrawlActions.config?

Also, did you grab a release or the latest from SVN?

And I've also set luceneDotNetIndexDirectory="e:\trazilica\index"
inside of CrawlActions.config
This needs to match the other directory settings.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

My index is empty nothing inside. I've tried with LUKE and didn't find anything. I was closing the app with ctrl-c and also by pressing return key

CrawlAcrtion is enabled. I also tried with rebuildIndexOnLoad=true. Same thing

I grabed latest code from SVN today (1.1) same thing.

I actually made mistake here luceneDotNetIndexDirectory points where it should be...

I'm running out of clues... Is it posible to see lucene log somehow?

Top 75 Contributor
6 Posts

Actually with the new code, and setting rebuildIndexOnLoad=true, after running console 2nd time, i get something in e:\myindex\index\temp

After I point web application and change global.asax.cs _indexSearcher = new IndexSearcher(@"E:\myindex\index\Temp\"); (just for testing purposes) I get results back.

The problem is that in oridinal index folder e:\myindex\index i have only two files segments.gen and segments_1 that are 20 Byte...

Top 10 Contributor
1,905 Posts

Is the ManageLuceceDotNetIndexes.cs code being called after each CrawlRequest?  Set a breakpoint at the function 'PerformAction'.  If the breakpoint doesn't hit, either the CrawlAction isn't enabled in CrawlActions.config or something else is wrong.

Is the CrawlAction enabled?  It looks like it is in the SVN browser on SourceForge: http://arachnodenet.svn.sourceforge.net/viewvc/arachnodenet/trunk/Configuration/CrawlActions.config?revision=160&view=markup 


Are there any exceptions in the Exceptions table?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

Yes the code is being called, PerformAction was executed as well, also CrawlAction is fine and set to true.

Nothing Interesting in Exceptions table only bunch of

The remote server returned an error: (406) Not Acceptable.
The remote server returned an error: (404) Not Found.

I'm still not too familiar with entire code, but if I understand things correctly PerormAction first index things wtih CurrentCrawl but at some point and then things are merged with MergeCurrentCrawl to main index. But I see MergeCurrentCrawl executed only once and inside it i see only _indexWriter.Optimize(); no _indexWriter.Flush(); and close _indexWriter.Close(); How then CurrentCrawl come down to main index? I mean, if i force index rebuild by setting rebuildIndexOnLoad="true" in index/temp folder I get good index, but in any case files under index stay empty...

 

btw. Thanks for your help on this :)

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

OK.  Good to read that the code is being called.  I'll grab a fresh checkout tomorrow and check that I get lucene results.

The 406 errors pertain to content that isn't allowed - like .mp3 files.  Check the DisallowedAbsoluteUris table in the database, if you are curious.

More later...

You are welcome!  :D

OK, back.

I grabbed a fresh copy of the code from SVN, restored the database and installed via these instructions:

1.) Download the code.
2.) Restore the database backup.
3.) Run the stored procedure '[dbo].[arachnode_usp_arachnode.net_RESET_DATABASE]'
4.) Press F5 on the solution.

And, I have results in the default lucene.net index directory.

Results are automatically flushed to disk per the lucene.net MergeFactor.  The method TearDownIndexWriter finanlizes the index at the end of the crawl, or when the crawl is shut down.

Next step is to see if AddDocument is being called.  Or, try stepping through the PerformAction method.  What happens?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

I didn't have time to take a look at it yet, but will do that tonight for sure. Will let you know...

Top 10 Contributor
1,905 Posts

Great!  Thanks!

(doesn't it always seem like the toughest part of coding is configuring?  :D)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
6 Posts

hehe very true :)

Page 1 of 1 (10 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC