arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Help with creating a rule

rated by 0 users
Answered (Verified) This post has 1 verified answer | 5 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Thu, Aug 13 2009 2:56 AM

Hello,

Needs help with creating a rule.

There is a new rule I have just created, very similar to absoluteuri.cs rule.

In parallel I have created a new plugin. I want this plugin will be execute only for speific pages with absolute uris contains this structure "/viewtopic.php?f={0}&t="

The rule need to disallow all of the rest.

So what I did with new rule is to overide only first IsDisallowed and make sure that the absoluteuri rule is executed.

Problem is that now the plugin excuted/fired only for 3 pages where there is at least 400 pages that applied this rule. The problem is that all the discovery links that I need is on pages with this structure: "/viewforum.php?f=".

What I wabt to achieve here is to tell the crawler to search all site and allow discovery lilnks only as shown above, What should I do? I don't want to make the filter inside the plugin page. is there any other way? I am posting the code here. please help.

        public override bool IsDisallowed(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)
        {


            return  IsDisallowed(crawlRequest, crawlRequest.Discovery.Uri);
        }

        public override bool IsDisallowed(Discovery discovery, ArachnodeDAO arachnodeDAO)
        {

            return false;// IsDisallowed(discovery, discovery.Uri);

        }

        protected bool IsDisallowed(ADisallowed aDisallowed, Uri uri)
        {
            bool isDisallowed = false;

            aDisallowed.OutputIsDisallowedReason = OutputIsDisallowedReason;

            #region Disallowed by AbsoluteUri.

 

                    if (uri.AbsolutePath != "/")
                    {
                        isDisallowed = true;
                        string[] categoryIDs = { "37", "4", "5", "6", "7", "8", "11", "9" };
                        for (int i = 0; i < categoryIDs.Length; i++)
                        {
                            if (
                                (uri.PathAndQuery.Contains(string.Format("/viewtopic.php?f={0}&t=", categoryIDsIdea)) && uri.PathAndQuery.Length < 30))
                            {
                                isDisallowed = false;
                                break;
                            }
                        }
                    }
           
            if (isDisallowed)
            {
                aDisallowed.IsDisallowedReason = "Disallowed by RKarusela.";

                return true;
            }

            #endregion


            return false;
        }

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by megetron

Check the actual text of the WebPage.  arachnode.net doesn't modify the AbsoluteUris in anyway other than to strip out 'www.'.

If your AbsoluteUris have '&sid=' after them it is because they are found in the WebPage source.

This site crawls just fine with the default configuration.  Run the RESET SP with '1'.

I can check into redirections, but what this site is doing is perfectly legit - and is one of the exact reasons why the CrawlRule AbsoluteUri.cs was created - to allow you to disallowe query strings.

From an IM conversation we just had I understand that your rule is the problem, and not arachnode.net.  Please post your rule.

Also, if those links are redirecting, I can finish the code that checks for redirects.

Finally: Your crawling requirement is a perfect case for a rule - because if they don't redirect, then AN has no way of knowing what goes on behind the scenes.  Like, if a site is entirely database driven and all links return the same content then AN has no way of knowing if it is being served the same page again and again.  Then, we get into duplicate page detection - but that's another topic entirely.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

The first thing to do when troubleshooting arachnode.net is check the following two tables:

DisallowedAbsoluteUris
Exceptions

What I'm guessing is that your other sites are being Disallowed for other reasons.  Place your plug-in first in execution order.  See cfg.CrawlRules.

Also, you'll need to uncomment the comment in the second override so that HyperLink discoveries are Disallowed too - it's faster that having them Queue and Dequeue from the Cache.

- Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts
megetron replied on Thu, Aug 13 2009 12:08 PM

Thanks.

After uncommented and change the order of the rule to 1 the website get crawled fine with this condition:
 (uri.PathAndQuery.Contains(string.Format("/viewtopic.php?f={0}&t=", categoryIDsIdea)))

but when I am using this condition:
 (uri.PathAndQuery.Contains(string.Format("/viewtopic.php?f={0}&t=", categoryIDsIdea)) && uri.PathAndQuery.Length < 30)

the crawl stops after pnly 4 pages found and there are more then 300 pages fit to this filter criteria.

II am looking into disallowedabsoluturi and see no suspisous record.

on the exception table I 4 errors and the last of them is:

Created ID AbsoluteUri1 AbsoluteUri2 HelpLink Message Source StackTrace
2009-08-13 22:04:11.920 5 http://seretv.net/ http://seretv.net/ NULL An item with the same key has already been added. mscorlib    at System.ThrowHelper.ThrowArgumentException(ExceptionResource resource)     at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)     at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)     at Arachnode.Plugins.CrawlActions.Karusela.PerformAction(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO) in E:\DEVELOPMENT\yeshira\Plugins\CrawlActions\Karusela.cs:line 255     at Arachnode.SiteCrawler.Managers.ActionManager.PerformCrawlActions(CrawlRequest crawlRequest, CrawlActionType crawlActionType, ArachnodeDAO arachnodeDAO) in E:\DEVELOPMENT\yeshira\SiteCrawler\Managers\ActionManager.cs:line 277

what can be the reason for that? 

Top 10 Contributor
1,905 Posts

#1 Most of your AbsoluteUris must be longer than 30 characters.

#2 Can't give you an exact answer without seeing the code.  Looks like you aren't checking for the existence of a key before adding it to your dictionary.

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

hi,

I reset all of the database and download a new version of the code just in case I did something wrong with it but still only 4 pages were scanned, so I remove the length<30 filter and scanned the site.

Then, I notice that the absoluteuris saved on the webpage tables looks like that::
http://seretv.net/viewtopic.php?f=9&t=464&start=0&sid=cdb6202621a065a4c15a2617b0e548d7

and when I browse on the website it looke like that:
http://seretv.net/viewtopic.php?f=9&t=464

it is a cookie thing probably or headerrespons I am not sure. so now for the same url I have 6 versions of the same url, where the absoluteuris looks like this in the webpages table:
http://seretv.net/viewtopic.php?f=9&t=464&start=0&sid=cdb6202621a065a4c15a2617b0e548d7
http://seretv.net/viewtopic.php?f=9&t=464&start=0&sid=ff8346465caaa3b694de2a5b7b157ba8
http://seretv.net/viewtopic.php?f=9&t=464&view=next&sid=cdb6202621a065a4c15a2617b0e548d7
http://seretv.net/viewtopic.php?f=9&t=464&view=next&sid=ff8346465caaa3b694de2a5b7b157ba8
http://seretv.net/viewtopic.php?f=9&t=464&view=previous&sid=cdb6202621a065a4c15a2617b0e548d7
http://seretv.net/viewtopic.php?f=9&t=464&view=previous&sid=ff8346465caaa3b694de2a5b7b157ba8

I tried activate the responseheader rule just but it disallowed absoluteuru http://seretv.net/ and the crawl stopped imiddietly.

Now I stuck on that, and cannot continue crawling.

is this the design? because right now the same page is crawled 6 or 7 times per crawl.

please advise,

Thank you!

Top 10 Contributor
1,905 Posts
Verified by megetron

Check the actual text of the WebPage.  arachnode.net doesn't modify the AbsoluteUris in anyway other than to strip out 'www.'.

If your AbsoluteUris have '&sid=' after them it is because they are found in the WebPage source.

This site crawls just fine with the default configuration.  Run the RESET SP with '1'.

I can check into redirections, but what this site is doing is perfectly legit - and is one of the exact reasons why the CrawlRule AbsoluteUri.cs was created - to allow you to disallowe query strings.

From an IM conversation we just had I understand that your rule is the problem, and not arachnode.net.  Please post your rule.

Also, if those links are redirecting, I can finish the code that checks for redirects.

Finally: Your crawling requirement is a perfect case for a rule - because if they don't redirect, then AN has no way of knowing what goes on behind the scenes.  Like, if a site is entirely database driven and all links return the same content then AN has no way of knowing if it is being served the same page again and again.  Then, we get into duplicate page detection - but that's another topic entirely.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (6 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC