arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Possible Robots.txt Bug - Release 2.5

rated by 0 users
Answered (Verified) This post has 1 verified answer | 4 Replies | 2 Followers

Top 50 Contributor
11 Posts
bscott posted on Wed, May 11 2011 7:27 AM

Am I doing something wrong or is this a bug?  What would you recommend I do?

My install is from the release-2.5 tag.

I realized that robots.txt wasn't working in my installation, so I walked through the code.  When it's trying to read the robots.txt file, I found that on line 375 of SiteCrawler > Compontents/Crawl.cs, it passes _crawlInfo.CurrentCrawlRequest which is null at this point.  I looked back through the code, and the only place I could find that this would be set is in the ProcessCrawlRequests method, which doesn't seem to apply to robots.txt.

I looked in the Exceptions table, and it is logging "Object reference not set to an instance of an object." for every page, which is the error I was finding when I walked through.  Here's the stack trace from the DB:

   at StateSideTools.SS_CrawlerHelper.EvaluateCrawlRequestResult(CrawlRequest completedRequest, WC_CrawlerJob job) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\StateSideTools\CrawlerHelper.cs:line 21     at Arachnode.Console.Program.Engine_CrawlRequestCompleted(CrawlRequest sender, EventArgs e) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\Console\Program.cs:line 421     at Arachnode.SiteCrawler.Core.Engine.OnCrawlRequestCompleted(CrawlRequest crawlRequest) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\SiteCrawler\Core\Engine.cs:line 645     at Arachnode.SiteCrawler.Components.Crawl.ProcessCrawlRequest(CrawlRequest crawlRequest, Boolean obeyCrawlRules, Boolean executeCrawlActions) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\SiteCrawler\Components\Crawl.cs:line 375     at Arachnode.SiteCrawler.Rules.RobotsDotText.IsDisallowed(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\SiteCrawler\Rules\RobotsDotText.cs:line 79     at Arachnode.SiteCrawler.Managers.RuleManager.IsDisallowed(CrawlRequest crawlRequest, CrawlRuleType crawlRuleType, ArachnodeDAO arachnodeDAO) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\SiteCrawler\Managers\RuleManager.cs:line 195

It also has one seemingly unrelated entry:

URI: NULL

MESSAGE:

"Cannot insert the value NULL into column 'Reason', table 'arachnode.net.dbo.DisallowedAbsoluteUris'; column does not allow nulls. INSERT fails.  Cannot insert the value NULL into column 'DisallowedAbsoluteUriID', table 'arachnode.net.dbo.DisallowedAbsoluteUris_Schemes_Discoveries'; column does not allow nulls. INSERT fails.  Cannot insert the value NULL into column 'DisallowedAbsoluteUriID', table 'arachnode.net.dbo.DisallowedAbsoluteUris_Hosts_Discoveries'; column does not allow nulls. INSERT fails.  The statement has been terminated.  The statement has been terminated.  The statement has been terminated."

STACK TRACE:

at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)     at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)     at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)     at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)     at System.Data.SqlClient.SqlDataReader.ConsumeMetaData()     at System.Data.SqlClient.SqlDataReader.get_MetaData()     at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)     at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async)     at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result)     at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method)     at System.Data.SqlClient.SqlCommand.ExecuteScalar()     at Arachnode.DataSource.ArachnodeDataSetTableAdapters.QueriesTableAdapter.InsertDisallowedAbsoluteUri(Nullable`1 ContentTypeID, Nullable`1 DiscoveryTypeID, String WebPageAbsoluteUri, String DisallowedAbsoluteUriAbsoluteUri, String Reason, Nullable`1 ClassifyAbsoluteUri, Nullable`1& DisallowedAbsoluteUriID) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\DataSource\ArachnodeDataSet.Designer.cs:line 14383     at Arachnode.DataAccess.ArachnodeDAO.InsertDisallowedAbsoluteUri(Int32 contentTypeID, Int32 discoveryTypeID, String webPageAbsoluteUri, String disallowedAbsoluteUriAbsoluteUri, String reason) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\DataAccess\ArachnodeDAO.cs:line 926

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Thanks for the detailed information... I will check it out and get it fixed.

Should be fixed now.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Take a look at RobotsDotText.cs from the trunk.  I believe this was fixed in the trunk.

The trunk is stable.  It would be worthwhile to install from the trunk as 2.5 is getting a bit stale in comparison.  I will very likely make a 2.6 tag this weekend.

Thank you for the detailed information.

And, welcome back Stateside.  Big Smile

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
11 Posts
bscott replied on Fri, May 13 2011 10:00 AM

Ok, I updated to the latest version of Arachnode from the trunk.  The previously described issue is resolved in this version, as you said.

The robots.txt is still not being handled properly, though.  I walked through the code in RobotsDotTextManager.cs, and discovered what the problem is.

It appears that blank lines are treated as the end of a group.  I don't see this written anywhere in the specifications for robots.txt files, and it is not a consistent example of how they are used in the real world.

For example, nbc.com, which is in your test cases, works fine because it has no blank lines between user-agent and group member.
http://www.nbc.com/robots.txt

However, washingtonpost.com makes liberal uses of blank lines to make it easier for humans to read.  This is appropriate based on the specifications I've read, but in arachnode's code, it causes it to improperly handle the group:
http://www.washingtonpost.com/robots.txt

As I understand, each group has a beginning (user-agent(s)) followed by group-members (i.e. disallows etc).  A group should be considered started by the first user-agent and ended once group-member(s) have been found followed by another user-agent.  That last user-agent causes the beginning of a new group.  There may be better logic for handling this, but blank lines should not factor in to the logic.  They should be ignored.

I can probably handle resolving this problem in my local code, but this should probably be resolved in the trunk as well.  Regardless, I'll be interested in hearing your solution.

Top 50 Contributor
11 Posts
bscott replied on Fri, May 13 2011 1:44 PM

I modified the code to ignore empty lines but still handle groups.  After I made my modifications, it appears to handle robots.txt files correctly.  I ran into a problem with how it compares the disallowed URLs, though.  It seems that it converts the URLs from the robots.txt file to lowercase, but it does not do the same for the URL it is comparing it to.

I'm signing off for the week.  I'll be back in Monday morning.

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Thanks for the detailed information... I will check it out and get it fixed.

Should be fixed now.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (5 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC