arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Can't turn off robots.txt

rated by 0 users
Answered (Verified) This post has 2 verified answers | 3 Replies | 2 Followers

Top 25 Contributor
20 Posts
David Rodecker posted on Sun, Sep 26 2010 1:59 PM

I am using version 2.5.3916.23112  and I have turned the robotsdottext = 0 (not enabled) but in my "disallowedUri" table after the crawl it says "disallowed by robots.txt"  I also try turning it to "=1" and that doesn't work either.  Therefore I am unable to turn it off whether in the database it is enabled or not.  Is this a bug or do I need to turn off robot.txt in a diffent place than just "[arachnode.net].[cfg].[CrawlRules]"  

 

Thanks in advance

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

You still have entries for those AbsoluteUris in your DisallowedAbsoluteUris table.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

First, let's verify that the Crawler.cs code thinks your robots.txt rule is disabled.

If it isn't it is either enabled in code:

...or in the DB.

If nothing looks strange, see if you can get your rules output to look like mine and give me a starting CrawlRequest so I can test, please.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

You still have entries for those AbsoluteUris in your DisallowedAbsoluteUris table.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

First, let's verify that the Crawler.cs code thinks your robots.txt rule is disabled.

If it isn't it is either enabled in code:

...or in the DB.

If nothing looks strange, see if you can get your rules output to look like mine and give me a starting CrawlRequest so I can test, please.

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC