arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Prohibited by Robots.txt

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 2 Followers

Top 100 Contributor
Male
4 Posts
w3hunter posted on Tue, Sep 1 2009 1:24 PM

Hi,

Project is working fine. I am trying to crawl few sites.

I'm not able to crawl one site and it is giving me the reason that "Prohibited by robots.txt".

I'm getting the error in DisallowedAbsoluteUris table.

Is there any way to overcome this problem ?

I know I'm asking a silly question but want to confirm if there is a way around it.

And if it is not possible, then is there a way that I can get all the urls for particular domain from another site.

For example, I want to crawl http://abc.com, but facing a problem of robots.txt

Now, there is another domain http://test.com which has links to http://abc.com and I can get links from http://test.com.

But there are other links, too on http://test.com, which I don't require.

Please help me out of it.

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by w3hunter

Check the cfg.CrawlRules table and disable the robotsdottext rule.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC