arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Limit Crawl Finds to a Particular folder

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 1 Follower

Top 10 Contributor
35 Posts
flash posted on Thu, Jan 26 2012 2:02 AM

Hello,

As I'm working on upgrading the crawler (thanks for that, BTW) I'm trying to tweak it to better suite our current needs.
Maybe you have some pointers in the following issues :

  1. Currently we only seek downloading plain HTMLs (WebPages and their Metadata).
    What are the best cfg.Configurations (and other cfg. ) you'd suggest ?
  2. In trying to speed up crawl - Is there a way to limit WebPages finds to a particular folder within the crawled site ?
    What I mean is, all my needed HTMLs are in a specific category of the site located in www.sitename.com/Sports. Can I instruct the crawler to save WebPages from that specific folder only ?

Thank you in advance,
Runny

Answered (Verified) Verified Answer

Top 10 Contributor
1,694 Posts
Verified by arachnode.net

1.) Look in Program.cs.  You can now turn off anything you don't want to collect.  I would adjust the settings via ApplicationSettings.cs first in Program.cs.  You'll also see you can turn off each of the rules in Program.cs as well.  It's really up to you want to want to collect.  If you don't want to collect HyperLinks, turn off InsertHyperLinks, etc.  The rules do affect crawling, so that's also up to you as what you want enabled.

                    //ApplicationSettings can be set from code, overriding Database settings found in cfg.Configuration.
                    ApplicationSettings.AssignCrawlRequestPrioritiesForFiles = true;
                    ApplicationSettings.AssignCrawlRequestPrioritiesForHyperLinks = true;
                    ApplicationSettings.AssignCrawlRequestPrioritiesForImages = true;
                    ApplicationSettings.AssignCrawlRequestPrioritiesForWebPages = true;
                    ApplicationSettings.AssignEmailAddressDiscoveries = false;
                    ApplicationSettings.AssignFileAndImageDiscoveries = false;
                    ApplicationSettings.AssignHyperLinkDiscoveries = true;
                    ApplicationSettings.ClassifyAbsoluteUris = false;
                    //ApplicationSettings.ConnectionString = "";
                    //ApplicationSettings.ConsoleOutputLogsDirectory = "";
                    ApplicationSettings.CrawlRequestTimeoutInMinutes = 1;
                    ApplicationSettings.CreateCrawlRequestsFromDatabaseCrawlRequests = true;
                    ApplicationSettings.CreateCrawlRequestsFromDatabaseFiles = false;
                    ApplicationSettings.CreateCrawlRequestsFromDatabaseHyperLinks = false;
                    ApplicationSettings.CreateCrawlRequestsFromDatabaseImages = false;
                    ApplicationSettings.CreateCrawlRequestsFromDatabaseWebPages = false;
                    ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 1024;
                    //ApplicationSettings.DownloadedFilesDirectory = "";
                    //ApplicationSettings.DownloadedImagesDirectory = "";
                    //ApplicationSettings.DownloadedWebPagesDirectory = "";
                    ApplicationSettings.EnableConsoleOutput = true;
                    ApplicationSettings.ExtractFileMetaData = false;
                    ApplicationSettings.ExtractImageMetaData = false;
                    ApplicationSettings.ExtractWebPageMetaData = false;
                    ApplicationSettings.HttpWebRequestRetries = 5;
                    ApplicationSettings.InsertDisallowedAbsoluteUriDiscoveries = false;
                    ApplicationSettings.InsertDisallowedAbsoluteUris = true;
                    ApplicationSettings.InsertEmailAddressDiscoveries = false;
                    ApplicationSettings.InsertEmailAddresses = false;
                    ApplicationSettings.InsertExceptions = true;
                    ApplicationSettings.InsertFileDiscoveries = false;
                    ApplicationSettings.InsertFileMetaData = false;
                    ApplicationSettings.InsertFiles = false;
                    ApplicationSettings.InsertFileSource = false;
                    ApplicationSettings.InsertHyperLinkDiscoveries = false;
                    ApplicationSettings.InsertHyperLinks = false;
                    ApplicationSettings.InsertImageDiscoveries = false;
                    ApplicationSettings.InsertImageMetaData = false;
                    ApplicationSettings.InsertImages = false;
                    ApplicationSettings.InsertImageSource = false;
                    ApplicationSettings.InsertWebPageMetaData = false;
                    ApplicationSettings.InsertWebPages = true;
                    ApplicationSettings.InsertWebPageSource = false;
                    ApplicationSettings.MaximumNumberOfCrawlRequestsToCreatePerBatch = 1000;
                    ApplicationSettings.MaximumNumberOfCrawlThreads = 10;
                    ApplicationSettings.MaximumNumberOfHostsAndPrioritiesToSelect = 10000;
                    ApplicationSettings.OutputConsoleToLogs = false;
                    ApplicationSettings.OutputStatistics = false;
                    ApplicationSettings.SaveDiscoveredFilesToDisk = false;
                    ApplicationSettings.SaveDiscoveredImagesToDisk = false;
                    ApplicationSettings.SaveDiscoveredWebPagesToDisk = true;
                    ApplicationSettings.SqlCommandTimeoutInMinutes = 60;
                    ApplicationSettings.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.71 Safari/534.24"; //If you find yourself blocked from crawling a website, change this to a common crawler string, such as 'Googlebot' or 'Slurp'...
                    ApplicationSettings.VerboseOutput = false;
                    //enable VerboseOutput to see each Discovery and the the status of each Discovery returned from each Discovery.  (e.g. WebPages from each WebPage and Files/Images from each WebPage.)

...this is what I would configure for "WebPages only..."

2.) Yes.  Check out UriClassificationType.cs

    [Flags]
    public enum UriClassificationType : short
    {
        None = 0,
        Domain = 1,
        Extension = 2,
        FileExtension = 4,
        Host = 8,
        Scheme = 16,
        OriginalDirectoryLevelUp = 32,
        OriginalDirectory = 64,
        OriginalDirectoryLevel = 128,
        OriginalDirectoryLevelDown = 256
    }
These values can be logically OR'd, so you can specify OriginalDirectory and Domain, or whatever combination you want.  Look at DiscoveryManager.cs at IsDiscoveryRestricted(...) for more about the UriClassificationType directs crawling.
Also, look at IsStorable.cs.  You can allow crawling but only store data using IsStorable.  Smile
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC