I am wondering what options are available to reduce the amount of storage required for saving webpages etc. Right now after indexing two sites I am at about 7-8GB between saved webpages and Lucene indexes.
I am not storing images or anything else other than the webpages and the lucene indexes. Basically I want to be able to continue indexing and allowing for searches on my indexes without getting up to huge storage numbers if possible.
Thanks!
It does not happen very often but I may be wrong on this...
The space 'saved' by storing in the 8KB extents in the DB and efficiently on disk (Size: = Size on disk:) seems to be offset by the information required to either store a '0x0' in the Source column (less space) or a NULL value (more space)... So, unless you were storing a lot of files that were significantly below the cluster size of the drive, it is likely that SQL will take up slightly MORE space than storing on disk. My tests show about a 5% overhead.
(I compared the size of the WebPages directory with the size of the shrunk WebPages FILEGROUP file before and after setting the Source column of the WebPages table to '0x0' and the difference was greater than the size of the WebPages directory. Settings the Source column to 'NULL' increased the difference.)
http://msdn.microsoft.com/en-us/library/aa174529(v=sql.80).aspx
What you might want to try is either compressing the Source column or compressing the DownloadedWebPages folder.
http://www.microsoft.com/sqlserver/2008/en/us/compression.aspx
For best service when you require assistance:
Skype: arachnodedotnet
What are the distinct sizes of the WebPages directory and the Lucene.NET indexes?
The only thing you can really do to reduce the storage for WebPages is to compress the directory.
You can also set the AutoGrow to something less than 1GB for the DB files. (this will free up space on your drive(s) too...)
Would you please take a screenshot of the Lucene.NET directory?
The webpages directory is about 6GB and the Lucene.Net indexes are about 1.5GB
I'm not sure what you mean by compress the WebPages directory?
I will make the autogrow less than 1GB thanks!
I would take a screenshot but I have just restored everything to an older (smaller) version as I completely ran out of space on my server.
I'm trying to figure out why I need the WebPages to be saved? I tried changing the config to insertwebpages but when I ran a search no results were returned. As soon as I set it back to save webpages the search results worked again. Do I have to save webpages in order for the search to work?
Compress: Let Windows compress the contents of the files. You can always experiment with things like this as well:
http://support.microsoft.com/kb/307987
http://en.wikipedia.org/wiki/Data_cluster
http://support.microsoft.com/kb/140365
Yes, you do need to store the WebPages else Lucene.Net won't have anything to summarize. The text of the pages isn't stored in Lucene.Net as my testing indicated that searching was much faster when retrieving the WebPage from disk rather than extracting from the index itself. And, optimizing the index takes significantly longer when you have large amounts of data, as you would expect.
Alternatively to storing the WebPages on disk you could elect to insert the WebPage source and read from the database (requires code modification). This method will make better usage of disk space (due to cluster allocation), but in my experience personally and professionally, once the database table reaches N rows (depends on your system) the table becomes un-manageable (very subjective, I know), and operations such as re-organize and re-build simple take too long.
As an example, in 2008 I was aware of a 400 million row table that stored the text of posts (not complete webpages) on a 16-way server with 64GB of RAM and used a 50 drive SAN and it took 24 hours to re-organize the index, and it was clear that this particular implementation would not sustain the current rate of growth.
However, if you are building static indexes, this solution may work for you.
1.) How many WebPages total?
2.) What is the desired implementation of AN?
3.) When you do get the index back up to size, do take a screenshot, please.
Sorry about the delayed response here.
1) would be indexing about 10-15 webpages total
2) to allow users to perform searches on the 10-15 webpage index while updating the index potentially twice a year
3)will do
What is the code modification to elect to insert WebPage source and read from the DB? I know there is at least one config change but there must be more...
Thanks!!
1.) I mean, how many webpages have you collected for the size you have on disk?
2.) This may be a viable option, storing webpages in the DB.
3.) Cool. OK.
The modification? I will make it for you as this is a nice feature.
It will first.) examine the filesystem for the Discovery and then.) examine the DB and if the Discovery isn't found it will finally.) report the missing Discovery to the user and report the exception to the database.
1) sorry I had only collected 1 webpage for the size
So you will make the code changes for storing webpages in the DB and I can pull it down from SVN?
OK.
Yes. I will test it tonight and tomorrow and likely check it in tomorrow evening.
Checked in.
awesome thanks!! So the console project is where the changes have been made then? Should I just be able to run the console and it will now put everything in DB instead of file system? Are there any changes to the Web project as well?
thanks again
The changes were made in Plugins\SearchManager.cs and in the Web project.
When requesting functionality that requires the WebPage Source, AN will check the filesystem first, and then check the DB, and if no WebPage Source is found a message will be presented to the user indicating an error and a row will be inserted into the Exceptions table.
Check out the SVN history to see the changes.
(TortoiseSVN > Show log)
You can compare revisions to see the changes.
To store the WebPage Source in the database you need to set ApplicationSettings.InsertWebPageSource = true, and (optionally) set ApplicationSettings.SaveDiscoveredWebPagesToDisk = false.
That should be it.
Whenever a modification needs to be made to AN that is for the good of AN, and for others, I am happy to commit to the modification. This includes bug fixes and features such as this. When in doubt, please ask.
You are very welcome.
Thanks a lot Mike. Just to get your final opinion on this...
Do you think that saving to the DB will actually reduce the storage requirements and if yes by approximately how much?
Thanks again!
I will put on a crawl to give some real numbers.
Be back later today...
Thanks Mike this is great info and exactly what I was wondering!