arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008

Some questions about CrawlRequest

rated by 0 users
Answered (Verified) This post has 1 verified answer | 5 Replies | 2 Followers

Top 50 Contributor
5 Posts
baby_zrq posted on 26 Nov 2009 6:44 PM

hello,  I've relly been checking the code of AN, but there are some trouble for me. I hope get your help, thanks!

     1、what do these enum items mean?    public enum UriClassificationType : byte
    {
        None = 0,
        Domain = 1,
        Extension = 2,
        FileExtension = 4,
        Host = 8,
        Scheme = 16
    }
    2、In the DiscoveryManager.cs, what is the meaning and difference between the function IsCrawlRestricted and IsDiscoveryRestricted()?     /// <summary>
        /// Determines whether the specified crawl request is restricted.
        /// </summary>
        /// <param name="crawlRequest">The crawl request.</param>
        /// <param name="absoluteUri">The absolute URI.</param>
        /// <returns>
        ///  <c>true</c> if the specified crawl request is restricted; otherwise, <c>false</c>.
        /// </returns>
        internal static bool IsCrawlRestricted(CrawlRequest crawlRequest, string absoluteUri)
        {
        }         /// <summary>
        /// Determines whether [is discovery restricted] [the specified crawl request].
        /// </summary>
        /// <param name="crawlRequest">The crawl request.</param>
        /// <param name="absoluteUri">The absolute URI.</param>
        /// <returns>
        ///  <c>true</c> if [is discovery restricted] [the specified crawl request]; otherwise, <c>false</c>.
        /// </returns>
        internal static bool IsDiscoveryRestricted(CrawlRequest crawlRequest, string absoluteUri)
        {
        }
       3、if I actually want to get the specific data I really want to, i just write code in here
   if (extractWebPageMetaData)
                {
                }           And have a limit on the priority when Adding CrawlRequest to be crawled in the CrawlRequestManager.cs(ProcessHyperLinks).
         if (!DiscoveryManager.IsCrawlRestricted(crawlRequest, hyperLinkDiscovery.Uri.AbsoluteUri))
                            {
                            }
          Do I have any other methods to do that? If not ,i think AN could provide user with a interface class which user can modify to crawl the special content with a good efficiency.  it is very useful for user.
          Thank you!
          baby_zrq           

Answered (Verified) Verified Answer

Top 10 Contributor
1,202 Posts

1.) http://something.arachnode.net/Default.aspx

These classifications are used for restricting a Crawl to a certain Domain, Extension, FileExtension, Host, and/or Scheme.

2.) CrawlRestricted means where can the crawl go - can I follow a WebPage to another domain, etc.?  DiscoveryRestricted means can I download images that come from another domain?  (A WebPage may contain images/files from another Domain - is it OK to download those?)

3.) You should DEFINITELY write a plugin and NOT change the core because as the core changes/improves you will have to modify/merge your code in with mine.  Tell more more about what you are wanting to do for #3?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

All Replies

Top 10 Contributor
1,202 Posts

1.) http://something.arachnode.net/Default.aspx

These classifications are used for restricting a Crawl to a certain Domain, Extension, FileExtension, Host, and/or Scheme.

2.) CrawlRestricted means where can the crawl go - can I follow a WebPage to another domain, etc.?  DiscoveryRestricted means can I download images that come from another domain?  (A WebPage may contain images/files from another Domain - is it OK to download those?)

3.) You should DEFINITELY write a plugin and NOT change the core because as the core changes/improves you will have to modify/merge your code in with mine.  Tell more more about what you are wanting to do for #3?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 50 Contributor
5 Posts

Thank you for your reply!

In my project, i want to the special content like :

    Tn the page , http://arachnode.net/forums/, only crawl the webpage follow these :

   

        Then get the content of these webpage. So if i want to control the crawling action , i could   DEFINITELY write a plugin but i must  call the plugin in the core. Otherwise, it will crawl a lot of data which i don't want to get. Thank you!

Top 10 Contributor
1,202 Posts

How can I help?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 50 Contributor
5 Posts

hello again,

    You last tell me that "3.) You should DEFINITELY write a plugin and NOT change the core because as the core changes/improves you will have to modify/merge your code in with mine.  Tell more more about what you are wanting to do for #3?"     So i talk about what i am wanting to do for #3, and i want to ask you that " if i want to control the crawling action , i could   DEFINITELY write a plugin but i must  call the plugin in the core. Otherwise, it will crawl a lot of data which i don't want to get. ", is it right?
    Sorry, Trouble you again !
Top 10 Contributor
1,202 Posts

Yes, you should write a RegEx plugin to filter CrawlRequests and Discoveries.

See AbsoluteUri.cs.

Do you know how to call a plugin in the core?  See cfg.CrawlActions and AbsoluteUri.cs.

Thanks!
Mike

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Page 1 of 1 (6 items) | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2004-2010, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems