arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

un-encoded url as file path

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 100 Contributor
3 Posts
ba100 posted on Tue, Feb 28 2012 7:04 AM

Hello

I am saving Discovered Web pages to the disk. However I would like to extract the exact url for each of the pages as well.

For example when www.mydomain.com/into.html is stored inside  ...\www\mydomain\com\   folder, I would like the file name to be intro.html. Is this possible?

 

Thanks

 

Bimalka

All Replies

Top 10 Contributor
1,694 Posts

You could but there are certain characters, character combinations and filename lengths that are permitted under non-Windows webservers that won't save to disk properly.  Also, the DiscoverManager accounts for for a maximum path length including the hash.  So, if a file path gets long, and the file name gets really long you'll not be able to save the file, unless you invoke some semi-/un-documented Windows APIs to be able to store paths of nearly unlimited length.

You can change it if you want to here: public static string GetDiscoveryPath(string downloadedDiscoveryDirectory, string absoluteUri, string fullTextIndexType) in DiscoveryManager.cs)

Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC