arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

how to crawl a site with specific section

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 2 Followers

Top 10 Contributor
59 Posts
InvestisDev posted on Mon, Mar 19 2012 12:12 AM

Hello,

we have to crawl a site that is multi lingual and in our case the url will look something like

site in English language starts with -> http://xyz.com/en/homepage.aspx

site in German language starts with -> http://xyz.com/de-DE//homepage.aspx

while crawling a site in English, it also takes URL of German language too...

Is there any way to start crawling a site that starts with such path?

Thanks

Answered (Verified) Verified Answer

Top 10 Contributor
1,714 Posts
Verified by InvestisDev

Look at this post...

http://arachnode.net/forums/p/1718/15661.aspx#15661

[Flags]
    public enum UriClassificationType : short
    {
        None = 0,
        Domain = 1,
        Extension = 2,
        FileExtension = 4,
        Host = 8,
        Scheme = 16,
        OriginalDirectoryLevelUp = 32,
        OriginalDirectory = 64,
        OriginalDirectoryLevel = 128,
        OriginalDirectoryLevelDown = 256
    }
Logically 'OR' OriginalDirectory and whatever else you'd like to restrict your crawl to.  (OriginalDirectory will restrict to /en/ or /de-DE/...
Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC