arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Handling FTP links.

rated by 0 users
Answered (Verified) This post has 2 verified answers | 8 Replies | 2 Followers

Top 50 Contributor
8 Posts
ucg posted on Mon, Jan 10 2011 11:22 AM

Does Arachnode handle links in the form <A href="ftp://www.xxx.com/dir/filename.zip">?  I modified the tables to accept ".zip" as a file but there appears to be a problem with the "ftp:" since it not HTTP.

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts

Got it.  I will fix this tonight and check in.

Thanks!
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
8 Posts
Answered (Verified) ucg replied on Thu, Mar 10 2011 6:02 AM
Verified by arachnode.net

Thanks for updating the code for FTP links.

I'm just getting through testing the FTP downloading.  Got an exception with "FtpWebRequest.Method = method;" in WebClient.GetFtpWebResponse, the FTPWebRequest object does not have a set_Method method.  I just commented this out and it works fine except for throwing up View/Save prompts.  Those I eliminated by adding code in DataTypeManager.DetermineDataType(CrawlRequest crawlRequest) as follows:

                * * *

                // existing code.
                string contentType = crawlRequest.WebClient.HttpWebResponse.ContentType.Split(';')[0].ToLower().Replace("\"", "");

                // Force the content type based on the extension.
                if (extension.Equals(".pdf") && (!contentType.Equals("application/pdf"))) {
                    contentType = "application/pdf";
                }
                else if (extension.Equals(".zip") && (!contentType.Equals("application/zip"))) {
                    contentType = "application/zip";
                }
                else if (extension.Equals(".xls") && (!contentType.Equals("application/vnd.ms-excel"))) {
                    contentType = "application/vnd.ms-excel";
                }
                else if ((extension.Equals(".doc") || extension.Equals(".docx")) && (!contentType.Equals("application/msword"))) {
                    contentType = "application/msword";
                }

This forces the content type which was defaulting to "text/html" and causing the prompts.

All Replies

Top 10 Contributor
1,905 Posts

Select * From cfg.AllowedSchemes

Check 'DisallowedAbsoluteUris'.  The 'ftp://' links should be there.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
8 Posts
ucg replied on Tue, Jan 11 2011 1:51 PM

Are there any plans to support the FTP scheme then since I do not see it in the cfg.AllowedSchemes?

Top 10 Contributor
1,905 Posts

Oh, try adding the value.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
8 Posts
ucg replied on Wed, Feb 2 2011 12:58 PM

Did that.  Get an exception and traced it back to:

HttpWebRequest = (HttpWebRequest) WebRequest.Create(absoluteUri); 

in Arachnode.SiteCrawler.Components.WebClient.GetWebResponse

The exception is:

System.InvalidCastException:Unable to cast object of type 'System.Net.FtpWebRequest' to type 'System.Net.HttpWebRequest'.

 

Top 10 Contributor
1,905 Posts

Got it.  I will fix this tonight and check in.

Thanks!
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Performing tests on the FTP functionality now.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Checked in.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
8 Posts
Answered (Verified) ucg replied on Thu, Mar 10 2011 6:02 AM
Verified by arachnode.net

Thanks for updating the code for FTP links.

I'm just getting through testing the FTP downloading.  Got an exception with "FtpWebRequest.Method = method;" in WebClient.GetFtpWebResponse, the FTPWebRequest object does not have a set_Method method.  I just commented this out and it works fine except for throwing up View/Save prompts.  Those I eliminated by adding code in DataTypeManager.DetermineDataType(CrawlRequest crawlRequest) as follows:

                * * *

                // existing code.
                string contentType = crawlRequest.WebClient.HttpWebResponse.ContentType.Split(';')[0].ToLower().Replace("\"", "");

                // Force the content type based on the extension.
                if (extension.Equals(".pdf") && (!contentType.Equals("application/pdf"))) {
                    contentType = "application/pdf";
                }
                else if (extension.Equals(".zip") && (!contentType.Equals("application/zip"))) {
                    contentType = "application/zip";
                }
                else if (extension.Equals(".xls") && (!contentType.Equals("application/vnd.ms-excel"))) {
                    contentType = "application/vnd.ms-excel";
                }
                else if ((extension.Equals(".doc") || extension.Equals(".docx")) && (!contentType.Equals("application/msword"))) {
                    contentType = "application/msword";
                }

This forces the content type which was defaulting to "text/html" and causing the prompts.

Page 1 of 1 (9 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC