Am I doing something wrong or is this a bug? What would you recommend I do?
My install is from the release-2.5 tag.
I realized that robots.txt wasn't working in my installation, so I walked through the code. When it's trying to read the robots.txt file, I found that on line 375 of SiteCrawler > Compontents/Crawl.cs, it passes _crawlInfo.CurrentCrawlRequest which is null at this point. I looked back through the code, and the only place I could find that this would be set is in the ProcessCrawlRequests method, which doesn't seem to apply to robots.txt.
I looked in the Exceptions table, and it is logging "Object reference not set to an instance of an object." for every page, which is the error I was finding when I walked through. Here's the stack trace from the DB:
at StateSideTools.SS_CrawlerHelper.EvaluateCrawlRequestResult(CrawlRequest completedRequest, WC_CrawlerJob job) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\StateSideTools\CrawlerHelper.cs:line 21 at Arachnode.Console.Program.Engine_CrawlRequestCompleted(CrawlRequest sender, EventArgs e) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\Console\Program.cs:line 421 at Arachnode.SiteCrawler.Core.Engine.OnCrawlRequestCompleted(CrawlRequest crawlRequest) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\SiteCrawler\Core\Engine.cs:line 645 at Arachnode.SiteCrawler.Components.Crawl.ProcessCrawlRequest(CrawlRequest crawlRequest, Boolean obeyCrawlRules, Boolean executeCrawlActions) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\SiteCrawler\Components\Crawl.cs:line 375 at Arachnode.SiteCrawler.Rules.RobotsDotText.IsDisallowed(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\SiteCrawler\Rules\RobotsDotText.cs:line 79 at Arachnode.SiteCrawler.Managers.RuleManager.IsDisallowed(CrawlRequest crawlRequest, CrawlRuleType crawlRuleType, ArachnodeDAO arachnodeDAO) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\SiteCrawler\Managers\RuleManager.cs:line 195
It also has one seemingly unrelated entry:
"Cannot insert the value NULL into column 'Reason', table 'arachnode.net.dbo.DisallowedAbsoluteUris'; column does not allow nulls. INSERT fails. Cannot insert the value NULL into column 'DisallowedAbsoluteUriID', table 'arachnode.net.dbo.DisallowedAbsoluteUris_Schemes_Discoveries'; column does not allow nulls. INSERT fails. Cannot insert the value NULL into column 'DisallowedAbsoluteUriID', table 'arachnode.net.dbo.DisallowedAbsoluteUris_Hosts_Discoveries'; column does not allow nulls. INSERT fails. The statement has been terminated. The statement has been terminated. The statement has been terminated."
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.SqlDataReader.ConsumeMetaData() at System.Data.SqlClient.SqlDataReader.get_MetaData() at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString) at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, DbAsyncResult result) at System.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) at System.Data.SqlClient.SqlCommand.ExecuteScalar() at Arachnode.DataSource.ArachnodeDataSetTableAdapters.QueriesTableAdapter.InsertDisallowedAbsoluteUri(Nullable`1 ContentTypeID, Nullable`1 DiscoveryTypeID, String WebPageAbsoluteUri, String DisallowedAbsoluteUriAbsoluteUri, String Reason, Nullable`1 ClassifyAbsoluteUri, Nullable`1& DisallowedAbsoluteUriID) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\DataSource\ArachnodeDataSet.Designer.cs:line 14383 at Arachnode.DataAccess.ArachnodeDAO.InsertDisallowedAbsoluteUri(Int32 contentTypeID, Int32 discoveryTypeID, String webPageAbsoluteUri, String disallowedAbsoluteUriAbsoluteUri, String reason) in C:\Inetpub\wwwroot\Stateside-WebCrawler-v2.5\Arachnode\DataAccess\ArachnodeDAO.cs:line 926
Thanks for the detailed information... I will check it out and get it fixed.
Should be fixed now.
For best service when you require assistance:
Take a look at RobotsDotText.cs from the trunk. I believe this was fixed in the trunk.
The trunk is stable. It would be worthwhile to install from the trunk as 2.5 is getting a bit stale in comparison. I will very likely make a 2.6 tag this weekend.
Thank you for the detailed information.
And, welcome back Stateside.
Ok, I updated to the latest version of Arachnode from the trunk. The
previously described issue is resolved in this version, as you said.
robots.txt is still not being handled properly, though. I walked
through the code in RobotsDotTextManager.cs, and discovered what the
It appears that blank lines are treated as the end of a
group. I don't see this written anywhere in the specifications for
robots.txt files, and it is not a consistent example of how they are
used in the real world.
For example, nbc.com, which is in your test cases, works fine because it has no blank lines between user-agent and group member.
However, washingtonpost.com makes
liberal uses of blank lines to make it easier for humans to read. This
is appropriate based on the specifications I've read, but in arachnode's
code, it causes it to improperly handle the group:
As I understand, each
group has a beginning (user-agent(s)) followed by group-members (i.e.
disallows etc). A group should be considered started by the first
user-agent and ended once group-member(s) have been found followed by
another user-agent. That last user-agent causes the beginning of a new
group. There may be better logic for handling this, but blank lines
should not factor in to the logic. They should be ignored.
I can probably handle resolving this problem in my local code,
but this should probably be resolved in the trunk as well. Regardless,
I'll be interested in hearing your solution.
I modified the code to ignore empty lines but still handle groups. After I made my modifications, it appears to handle robots.txt files correctly. I ran into a problem with how it compares the disallowed URLs, though. It seems that it converts the URLs from the robots.txt file to lowercase, but it does not do the same for the URL it is comparing it to.
I'm signing off for the week. I'll be back in Monday morning.