arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Lucene index is empty after finishing crawl

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 1 Follower

Top 25 Contributor
23 Posts
victor posted on Thu, Jan 12 2017 2:22 PM

I'm seeing this output at the end of the crawl 

But still the lucene index is empty (only few bytes in size) and using Web project to search results always returns 0 rows.

I turned off saving any files to disk (no images, no webpages, no files), can this be the reason?

Could you advice some steps to investigate this?

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Yes, storing blobs of text in the lucene.net indexes is inefficient and duplicates data storage.

If/when you want to search the content, you'd have to extract the WebPage byte[] from a secondary file system (the lucene.net index) to copy it to disk so that it could be read by your web hosting process.

The data meta is there in the index, but the backing store wasn't saved by your configuration election.

The index is just that, an index.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 25 Contributor
23 Posts
victor replied on Thu, Jan 12 2017 4:45 PM

ok, seems, like lucene needs to access downloaded files on disk, and when I disable storing files on disk, lucene can not produce index files, is it true?

What if i don't want to store images and other documents?

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Yes, storing blobs of text in the lucene.net indexes is inefficient and duplicates data storage.

If/when you want to search the content, you'd have to extract the WebPage byte[] from a secondary file system (the lucene.net index) to copy it to disk so that it could be read by your web hosting process.

The data meta is there in the index, but the backing store wasn't saved by your configuration election.

The index is just that, an index.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
23 Posts
victor replied on Thu, Jan 12 2017 5:46 PM

ok, got it :)

And again thanks a lot!

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC