arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Exception: Expected a File or an Image but discovered a WebPage.

rated by 0 users
Answered (Verified) This post has 1 verified answer | 4 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Mon, Aug 17 2009 1:31 PM

Hello,

what is the reason it expects an file/image  if this is a web page? I get this error too many...

AbsoluteUri1 http://10net.co.il/108752/%D7%A6%D7%A4%D7%99%D7%99%D7%94-%D7%99%D7%A9%D7%99%D7%A8%D7%94-%D7%91%D7%A1%D7%A8%D7%98%D7%99-%D7%A7%D7%95%D7%9E%D7%93%D7%99%D7%94

AbsoluteUri2

http://10net.co.il/site/detail/detail/detailDetail.asp?detail_id=1286833&iPageNumCat0=2&seaWordCat=

 HelpLink

NULL

 Message

Expected a File or an Image but discovered a WebPage.

Source StackTrace

Arachnode.SiteCrawler at Arachnode.SiteCrawler.Components.Crawl.ProcessCrawlRequest(CrawlRequest crawlRequest, Boolean obeyCrawlRules, Boolean executeCrawlActions)

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by megetron

This exception has been completely removed from the upcoming Version 1.3 release.

This basically was an attempt at catching WebPages that had, say, valid image tags returning scripts...

The new 'DataManager' and the new PreGet CrawlRule type and the DataType.cs CrawlRule have fixed this annoyance.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

There are a good number of sites that list HyperLinks that should be images but return a WebPage instead.

Also, this classification isn't 100% accurate.  Needs a bit of work.

I found the bug in this piece of code.

This should be a fun one to fix properly.  :)

Thanks for all of your testing!

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Fri, Sep 4 2009 2:30 PM

I seem to get this error consistently when hitting http://nytimes.com/

What's a good way to see exactly what's coming back from that url to research why it's expecting a file or image? 

I almost wonder whether the site's default page is doing something tricky.

I know I can debug it, but what's a good tool that shows everything coming back?  Maybe firebug?

Thx

 

Top 10 Contributor
1,905 Posts
Verified by megetron

This exception has been completely removed from the upcoming Version 1.3 release.

This basically was an attempt at catching WebPages that had, say, valid image tags returning scripts...

The new 'DataManager' and the new PreGet CrawlRule type and the DataType.cs CrawlRule have fixed this annoyance.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

glad to hear this. the solution sounds good. this exception is flooding the exception tables in a manner that when quering this table you must filter thie errors.

Thank you for the fix.

Page 1 of 1 (5 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC