arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Invalid URI

rated by 0 users
Not Answered This post has 0 verified answers | 6 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Sun, Apr 11 2010 11:01 AM

Hi,

I was trying to crawl for "http://www." url.

don't ask why, this is a long story.  I just recieve an exception and I just wonder if there is a chance that AN will pick such URIs from the net.? if it does, then, this is a bug. if not, you can ignore this.

 

All Replies

Top 10 Contributor
1,905 Posts

Hmm... odd that you get the exception there...

I get it at the console.  Were you feed the invalid Uri from the DB?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts
megetron replied on Sun, Apr 11 2010 11:06 PM

Yes. and add this as a new CrawlRequest  :\

Top 10 Contributor
1,905 Posts

Hmm...  I did, but I get the exception handled properly.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Maybe this is something to do with the fact I am running the release version (4) and with a debug mode?

if it happens here it must happens with other machine :)

Top 10 Contributor
1,905 Posts

Looking at the your first screenshot a bit closer, I see the error occurring in the internal overload, which means your exception IS coming from an AbsoluteUri in the CrawlRequests table.  There ARE CHECK CONSTRAINTS on the table, but your condtion isn't caught.  I think this is OK though.  I encourage users to NOT use the CrawlRequests table as this table is really for the Cache to use for storage, OR for CrawlRequests submitted through the ArachnodeDAO.

AN will catch the error and report it to the user if this AbsoluteUri is supplied through code, but apparantly won't if you directly manipulate the CrawlRequests table... hmm...

It's probably OK to do nothing on this one.

Veto?  Override?  Do you think I should modify the DB CHECK CONSTRAINTS to not allow this?

- Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

This is an old one:)

But as far as I can recall...I didn't populate the URI in the CrawlRequest table manually. I used the code to crawl this URI...:\

Page 1 of 1 (7 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC