Hello,
As I'm working on upgrading the crawler (thanks for that, BTW) I'm trying to tweak it to better suite our current needs.Maybe you have some pointers in the following issues :
Thank you in advance,Runny
1.) Look in Program.cs. You can now turn off anything you don't want to collect. I would adjust the settings via ApplicationSettings.cs first in Program.cs. You'll also see you can turn off each of the rules in Program.cs as well. It's really up to you want to want to collect. If you don't want to collect HyperLinks, turn off InsertHyperLinks, etc. The rules do affect crawling, so that's also up to you as what you want enabled.
//ApplicationSettings can be set from code, overriding Database settings found in cfg.Configuration. ApplicationSettings.AssignCrawlRequestPrioritiesForFiles = true; ApplicationSettings.AssignCrawlRequestPrioritiesForHyperLinks = true; ApplicationSettings.AssignCrawlRequestPrioritiesForImages = true; ApplicationSettings.AssignCrawlRequestPrioritiesForWebPages = true; ApplicationSettings.AssignEmailAddressDiscoveries = false; ApplicationSettings.AssignFileAndImageDiscoveries = false; ApplicationSettings.AssignHyperLinkDiscoveries = true; ApplicationSettings.ClassifyAbsoluteUris = false; //ApplicationSettings.ConnectionString = ""; //ApplicationSettings.ConsoleOutputLogsDirectory = ""; ApplicationSettings.CrawlRequestTimeoutInMinutes = 1; ApplicationSettings.CreateCrawlRequestsFromDatabaseCrawlRequests = true; ApplicationSettings.CreateCrawlRequestsFromDatabaseFiles = false; ApplicationSettings.CreateCrawlRequestsFromDatabaseHyperLinks = false; ApplicationSettings.CreateCrawlRequestsFromDatabaseImages = false; ApplicationSettings.CreateCrawlRequestsFromDatabaseWebPages = false; ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 1024; //ApplicationSettings.DownloadedFilesDirectory = ""; //ApplicationSettings.DownloadedImagesDirectory = ""; //ApplicationSettings.DownloadedWebPagesDirectory = ""; ApplicationSettings.EnableConsoleOutput = true; ApplicationSettings.ExtractFileMetaData = false; ApplicationSettings.ExtractImageMetaData = false; ApplicationSettings.ExtractWebPageMetaData = false; ApplicationSettings.HttpWebRequestRetries = 5; ApplicationSettings.InsertDisallowedAbsoluteUriDiscoveries = false; ApplicationSettings.InsertDisallowedAbsoluteUris = true; ApplicationSettings.InsertEmailAddressDiscoveries = false; ApplicationSettings.InsertEmailAddresses = false; ApplicationSettings.InsertExceptions = true; ApplicationSettings.InsertFileDiscoveries = false; ApplicationSettings.InsertFileMetaData = false; ApplicationSettings.InsertFiles = false; ApplicationSettings.InsertFileSource = false; ApplicationSettings.InsertHyperLinkDiscoveries = false; ApplicationSettings.InsertHyperLinks = false; ApplicationSettings.InsertImageDiscoveries = false; ApplicationSettings.InsertImageMetaData = false; ApplicationSettings.InsertImages = false; ApplicationSettings.InsertImageSource = false; ApplicationSettings.InsertWebPageMetaData = false; ApplicationSettings.InsertWebPages = true; ApplicationSettings.InsertWebPageSource = false; ApplicationSettings.MaximumNumberOfCrawlRequestsToCreatePerBatch = 1000; ApplicationSettings.MaximumNumberOfCrawlThreads = 10; ApplicationSettings.MaximumNumberOfHostsAndPrioritiesToSelect = 10000; ApplicationSettings.OutputConsoleToLogs = false; ApplicationSettings.OutputStatistics = false; ApplicationSettings.SaveDiscoveredFilesToDisk = false; ApplicationSettings.SaveDiscoveredImagesToDisk = false; ApplicationSettings.SaveDiscoveredWebPagesToDisk = true; ApplicationSettings.SqlCommandTimeoutInMinutes = 60; ApplicationSettings.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.71 Safari/534.24"; //If you find yourself blocked from crawling a website, change this to a common crawler string, such as 'Googlebot' or 'Slurp'... ApplicationSettings.VerboseOutput = false; //enable VerboseOutput to see each Discovery and the the status of each Discovery returned from each Discovery. (e.g. WebPages from each WebPage and Files/Images from each WebPage.)
...this is what I would configure for "WebPages only..."
2.) Yes. Check out UriClassificationType.cs
[Flags] public enum UriClassificationType : short { None = 0, Domain = 1, Extension = 2, FileExtension = 4, Host = 8, Scheme = 16, OriginalDirectoryLevelUp = 32, OriginalDirectory = 64, OriginalDirectoryLevel = 128, OriginalDirectoryLevelDown = 256 }
These values can be logically OR'd, so you can specify OriginalDirectory and Domain, or whatever combination you want. Look at DiscoveryManager.cs at IsDiscoveryRestricted(...) for more about the UriClassificationType directs crawling.
Also, look at IsStorable.cs. You can allow crawling but only store data using IsStorable.
Mike
For best service when you require assistance:
Skype: arachnodedotnet