arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

How to reduce storage requirements...

rated by 0 users
Answered (Verified) This post has 1 verified answer | 14 Replies | 2 Followers

Top 25 Contributor
19 Posts
ptrennum posted on Tue, Jul 5 2011 11:39 AM

I am wondering what options are available to reduce the amount of storage required for saving webpages etc.  Right now after indexing two sites I am at about 7-8GB between saved webpages and Lucene indexes.

I am not storing images or anything else other than the webpages and the lucene indexes.  Basically I want to be able to continue indexing and allowing for searches on my indexes without getting up to huge storage numbers if possible.

Thanks!

Answered (Verified) Verified Answer

Top 10 Contributor
1,696 Posts
Verified by arachnode.net

It does not happen very often Stick out tongue but I may be wrong on this...

The space 'saved' by storing in the 8KB extents in the DB and efficiently on disk (Size: = Size on disk:) seems to be offset by the information required to either store a '0x0' in the Source column (less space) or a NULL value (more space)...  So, unless you were storing a lot of files that were significantly below the cluster size of the drive, it is likely that SQL will take up slightly MORE space than storing on disk.  My tests show about a 5% overhead.

(I compared the size of the WebPages directory with the size of the shrunk WebPages FILEGROUP file before and after setting the Source column of the WebPages table to '0x0' and the difference was greater than the size of the WebPages directory.  Settings the Source column to 'NULL' increased the difference.)

http://msdn.microsoft.com/en-us/library/aa174529(v=sql.80).aspx

What you might want to try is either compressing the Source column or compressing the DownloadedWebPages folder.

http://www.microsoft.com/sqlserver/2008/en/us/compression.aspx

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,696 Posts

What are the distinct sizes of the WebPages directory and the Lucene.NET indexes?

The only thing you can really do to reduce the storage for WebPages is to compress the directory.

You can also set the AutoGrow to something less than 1GB for the DB files.  (this will free up space on your drive(s) too...)

Would you please take a screenshot of the Lucene.NET directory?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

The webpages directory is about 6GB and the Lucene.Net indexes are about 1.5GB

I'm not sure what you mean by compress the WebPages directory? 

I will make the autogrow less than 1GB thanks!

I would take a screenshot but I have just restored everything to an older (smaller) version as I completely ran out of space on my server.

I'm trying to figure out why I need the WebPages to be saved?  I tried changing the config to insertwebpages but when I ran a search no results were returned.  As soon as I set it back to save webpages the search results worked again.  Do I have to save webpages in order for the search to work?

Thanks!

Top 10 Contributor
1,696 Posts

Compress: Let Windows compress the contents of the files.  You can always experiment with things like this as well:

http://support.microsoft.com/kb/307987

http://en.wikipedia.org/wiki/Data_cluster

http://support.microsoft.com/kb/140365

Yes, you do need to store the WebPages else Lucene.Net won't have anything to summarize.  The text of the pages isn't stored in Lucene.Net as my testing indicated that searching was much faster when retrieving the WebPage from disk rather than extracting from the index itself.  And, optimizing the index takes significantly longer when you have large amounts of data, as you would expect.

Alternatively to storing the WebPages on disk you could elect to insert the WebPage source and read from the database (requires code modification).  This method will make better usage of disk space (due to cluster allocation), but in my experience personally and professionally, once the database table reaches N rows (depends on your system) the table becomes un-manageable (very subjective, I know), and operations such as re-organize and re-build simple take too long.

As an example, in 2008 I was aware of a 400 million row table that stored the text of posts (not complete webpages) on a 16-way server with 64GB of RAM and used a 50 drive SAN and it took 24 hours to re-organize the index, and it was clear that this particular implementation would not sustain the current rate of growth.

However, if you are building static indexes, this solution may work for you.

1.) How many WebPages total?

2.) What is the desired implementation of AN?

3.) When you do get the index back up to size, do take a screenshot, please.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

Sorry about the delayed response here.

1) would be indexing about 10-15 webpages total

2) to allow users to perform searches on the 10-15 webpage index while updating the index potentially twice a year

3)will do

What is the code modification to elect to insert WebPage source and read from the DB?  I know there is at least one config change but there must be more...

Thanks!!

Top 10 Contributor
1,696 Posts

1.) I mean, how many webpages have you collected for the size you have on disk?  Big Smile

2.) This may be a viable option, storing webpages in the DB.

3.) Cool.  OK.

The modification?  I will make it for you as this is a nice feature.

It will first.) examine the filesystem for the Discovery and then.) examine the DB and if the Discovery isn't found it will finally.) report the missing Discovery to the user and report the exception to the database.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

1) sorry I had only collected 1 webpage for the size

So you will make the code changes for storing webpages in the DB and I can pull it down from SVN?

Top 10 Contributor
1,696 Posts

OK.

Yes.  I will test it tonight and tomorrow and likely check it in tomorrow evening.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,696 Posts

Checked in.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

awesome thanks!!  So the console project is where the changes have been made then?  Should I just be able to run the console and it will now put everything in DB instead of file system?  Are there any changes to the Web project as well?

thanks again

Top 10 Contributor
1,696 Posts

The changes were made in Plugins\SearchManager.cs and in the Web project.

When requesting functionality that requires the WebPage Source, AN will check the filesystem first, and then check the DB, and if no WebPage Source is found a message will be presented to the user indicating an error and a row will be inserted into the Exceptions table.

Check out the SVN history to see the changes.

(TortoiseSVN > Show log)

You can compare revisions to see the changes.

To store the WebPage Source in the database you need to set ApplicationSettings.InsertWebPageSource = true, and (optionally) set ApplicationSettings.SaveDiscoveredWebPagesToDisk = false.

That should be it.

Whenever a modification needs to be made to AN that is for the good of AN, and for others, I am happy to commit to the modification.  This includes bug fixes and features such as this.  When in doubt, please ask.  Big Smile

You are very welcome.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts
ptrennum replied on Sat, Jul 16 2011 11:01 AM

Thanks a lot Mike.  Just to get your final opinion on this...

Do you think that saving to the DB will actually reduce the storage requirements and if yes by approximately how much?

Thanks again!

Top 10 Contributor
1,696 Posts

You are very welcome.

I will put on a crawl to give some real numbers.

Be back later today...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,696 Posts
Verified by arachnode.net

It does not happen very often Stick out tongue but I may be wrong on this...

The space 'saved' by storing in the 8KB extents in the DB and efficiently on disk (Size: = Size on disk:) seems to be offset by the information required to either store a '0x0' in the Source column (less space) or a NULL value (more space)...  So, unless you were storing a lot of files that were significantly below the cluster size of the drive, it is likely that SQL will take up slightly MORE space than storing on disk.  My tests show about a 5% overhead.

(I compared the size of the WebPages directory with the size of the shrunk WebPages FILEGROUP file before and after setting the Source column of the WebPages table to '0x0' and the difference was greater than the size of the WebPages directory.  Settings the Source column to 'NULL' increased the difference.)

http://msdn.microsoft.com/en-us/library/aa174529(v=sql.80).aspx

What you might want to try is either compressing the Source column or compressing the DownloadedWebPages folder.

http://www.microsoft.com/sqlserver/2008/en/us/compression.aspx

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

Thanks Mike this is great info and exactly what I was wondering!

Page 1 of 1 (15 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC