arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Some questions about CrawlRequest

rated by 0 users
Answered (Verified) This post has 1 verified answer | 7 Replies | 3 Followers

Top 75 Contributor
5 Posts
baby_zrq posted on Thu, Nov 26 2009 6:44 PM

hello,  I've relly been checking the code of AN, but there are some trouble for me. I hope get your help, thanks!

     1、what do these enum items mean?    public enum UriClassificationType : byte
    {
        None = 0,
        Domain = 1,
        Extension = 2,
        FileExtension = 4,
        Host = 8,
        Scheme = 16
    }
    2、In the DiscoveryManager.cs, what is the meaning and difference between the function IsCrawlRestricted and IsDiscoveryRestricted()?     /// <summary>
        /// Determines whether the specified crawl request is restricted.
        /// </summary>
        /// <param name="crawlRequest">The crawl request.</param>
        /// <param name="absoluteUri">The absolute URI.</param>
        /// <returns>
        ///  <c>true</c> if the specified crawl request is restricted; otherwise, <c>false</c>.
        /// </returns>
        internal static bool IsCrawlRestricted(CrawlRequest crawlRequest, string absoluteUri)
        {
        }         /// <summary>
        /// Determines whether [is discovery restricted] [the specified crawl request].
        /// </summary>
        /// <param name="crawlRequest">The crawl request.</param>
        /// <param name="absoluteUri">The absolute URI.</param>
        /// <returns>
        ///  <c>true</c> if [is discovery restricted] [the specified crawl request]; otherwise, <c>false</c>.
        /// </returns>
        internal static bool IsDiscoveryRestricted(CrawlRequest crawlRequest, string absoluteUri)
        {
        }
       3、if I actually want to get the specific data I really want to, i just write code in here
   if (extractWebPageMetaData)
                {
                }           And have a limit on the priority when Adding CrawlRequest to be crawled in the CrawlRequestManager.cs(ProcessHyperLinks).
         if (!DiscoveryManager.IsCrawlRestricted(crawlRequest, hyperLinkDiscovery.Uri.AbsoluteUri))
                            {
                            }
          Do I have any other methods to do that? If not ,i think AN could provide user with a interface class which user can modify to crawl the special content with a good efficiency.  it is very useful for user.
          Thank you!
          baby_zrq           

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Answered (Verified) arachnode.net replied on Fri, Nov 27 2009 9:18 AM
Verified by arachnode.net

1.) http://something.arachnode.net/Default.aspx

These classifications are used for restricting a Crawl to a certain Domain, Extension, FileExtension, Host, and/or Scheme.

2.) CrawlRestricted means where can the crawl go - can I follow a WebPage to another domain, etc.?  DiscoveryRestricted means can I download images that come from another domain?  (A WebPage may contain images/files from another Domain - is it OK to download those?)

3.) You should DEFINITELY write a plugin and NOT change the core because as the core changes/improves you will have to modify/merge your code in with mine.  Tell more more about what you are wanting to do for #3?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Answered (Verified) arachnode.net replied on Fri, Nov 27 2009 9:18 AM
Verified by arachnode.net

1.) http://something.arachnode.net/Default.aspx

These classifications are used for restricting a Crawl to a certain Domain, Extension, FileExtension, Host, and/or Scheme.

2.) CrawlRestricted means where can the crawl go - can I follow a WebPage to another domain, etc.?  DiscoveryRestricted means can I download images that come from another domain?  (A WebPage may contain images/files from another Domain - is it OK to download those?)

3.) You should DEFINITELY write a plugin and NOT change the core because as the core changes/improves you will have to modify/merge your code in with mine.  Tell more more about what you are wanting to do for #3?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
5 Posts

Thank you for your reply!

In my project, i want to the special content like :

    Tn the page , http://arachnode.net/forums/, only crawl the webpage follow these :

   

        Then get the content of these webpage. So if i want to control the crawling action , i could   DEFINITELY write a plugin but i must  call the plugin in the core. Otherwise, it will crawl a lot of data which i don't want to get. Thank you!

Top 10 Contributor
1,905 Posts

How can I help?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
5 Posts

hello again,

    You last tell me that "3.) You should DEFINITELY write a plugin and NOT change the core because as the core changes/improves you will have to modify/merge your code in with mine.  Tell more more about what you are wanting to do for #3?"     So i talk about what i am wanting to do for #3, and i want to ask you that " if i want to control the crawling action , i could   DEFINITELY write a plugin but i must  call the plugin in the core. Otherwise, it will crawl a lot of data which i don't want to get. ", is it right?
    Sorry, Trouble you again !
Top 10 Contributor
1,905 Posts

Yes, you should write a RegEx plugin to filter CrawlRequests and Discoveries.

See AbsoluteUri.cs.

Do you know how to call a plugin in the core?  See cfg.CrawlActions and AbsoluteUri.cs.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
14 Posts

Hi Mike

Clarification needed on #2. As i understand ...

Crawl = actual download, Discovery = Just knowledge of hyperlink/image or file URI (from same or another domain) obtained after parsing the Crawled Resource.

 

So if i want only the text of the webpage and no image/file AND only from the initial crawlrequest domain ? How should these *Restricted variables look like ?

 

thanks.

Top 10 Contributor
1,905 Posts

You are correct on the first part.

Restrict to domain.  UriClassificationType.Domain and set ExtractFilesAndImages = false.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (8 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC