I need access to the contents of documents in the CrawlRequestCompleted event. For PDFs, I'm able to get this easily from the byte array attached to the CrawlRequest using the PDFManager's GetText(byte[]) method. For Office documents, the best I've been able to do so far in the API is use DOCManager's GetText(string discoveryPath) method, which I believe grabs the file off of the hard drive. Ideally, I would prefer not to save the files to hard drive, though.
These are the approaches I'm considering
Which would you recommend, and how would you recommend I go about it? I've had trouble figuring out how to do the former through the API. For the latter, I imagine I can find a solution from outside Arachnode, but I was hoping there might be something built in.
I was able to do what I needed. I changed the configuration so that Files are no longer saved to the hard drive. I now save each file myself, pass each path to the DOCManager, then delete the file when it's done. I ran into a problem doing this. DOCManager was not releasing the file when it was done, so I couldn't delete it. I had to add this line of code inside doc manager in a finally clause:
Marshal.ReleaseComObject(ifilt);
Look at DiscoveryManager.GetDiscoveryPath(...) to get the location of the file on disk.
Looks like you will have to delete the file from disk. The API that I am using only has facilities to load from disk.
If you can find a better way to read from a stream please let me know.
Take a look at this: http://www.codeproject.com/KB/cs/IFilter.aspx It has been a long time since I looked at the DOC functionality, so let me know if you find something better...
Mike
For best service when you require assistance:
Skype: arachnodedotnet
Thanks. I'll update the finally clause.