arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Managing Files

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 2 Followers

Top 50 Contributor
11 Posts
bscott posted on Mon, May 23 2011 7:32 AM

I need access to the contents of documents in the CrawlRequestCompleted event.  For PDFs, I'm able to get this easily from the byte array attached to the CrawlRequest using the PDFManager's GetText(byte[]) method.  For Office documents, the best I've been able to do so far in the API is use DOCManager's GetText(string discoveryPath) method, which I believe grabs the file off of the hard drive.  Ideally, I would prefer not to save the files to hard drive, though.

These are the approaches I'm considering

  • Delete each file after I'm done processing it
  • Find another way to read the document contents

Which would you recommend, and how would you recommend I go about it?  I've had trouble figuring out how to do the former through the API.  For the latter, I imagine I can find a solution from outside Arachnode, but I was hoping there might be something built in.

Answered (Verified) Verified Answer

Top 50 Contributor
11 Posts
Answered (Verified) bscott replied on Wed, May 25 2011 8:35 AM
Verified by bscott

I was able to do what I needed.  I changed the configuration so that Files are no longer saved to the hard drive.  I now save each file myself, pass each path to the DOCManager, then delete the file when it's done.  I ran into a problem doing this.  DOCManager was not releasing the file when it was done, so I couldn't delete it.  I had to add this line of code inside doc manager in a finally clause:

Marshal.ReleaseComObject(ifilt);

All Replies

Top 10 Contributor
1,714 Posts

Look at DiscoveryManager.GetDiscoveryPath(...) to get the location of the file on disk.

Looks like you will have to delete the file from disk.  The API that I am using only has facilities to load from disk.

If you can find a better way to read from a stream please let me know.

Take a look at this: http://www.codeproject.com/KB/cs/IFilter.aspx  It has been a long time since I looked at the DOC functionality, so let me know if you find something better...

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
11 Posts
Answered (Verified) bscott replied on Wed, May 25 2011 8:35 AM
Verified by bscott

I was able to do what I needed.  I changed the configuration so that Files are no longer saved to the hard drive.  I now save each file myself, pass each path to the DOCManager, then delete the file when it's done.  I ran into a problem doing this.  DOCManager was not releasing the file when it was done, so I couldn't delete it.  I had to add this line of code inside doc manager in a finally clause:

Marshal.ReleaseComObject(ifilt);

Top 10 Contributor
1,714 Posts

Thanks.  I'll update the finally clause.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC