arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Encoding issue

rated by 0 users
Answered (Verified) This post has 1 verified answer | 15 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Mon, Aug 3 2009 12:22 AM

I am crawling a website with hebrew fonts and the pages that saved in the webpages folders are saved with jibrish fonts:

 

why is that?

Answered (Verified) Verified Answer

Top 10 Contributor
1,692 Posts
Verified by megetron

I think you are right - I'll check it out tonight and update the trunk.

Thanks megetron.

-Mike

OK - fixed and checked in.

The problem was with how the file was being saved.  I was using a FileStream but needed to use a StreamWriter and specify UTF8 as the encoding.

if

 

(saveWebPageToDisk)

{

managedWebPage.DiscoveryPath =

DiscoveryManager.GetDiscoveryPath(ApplicationSettings.DownloadedWebPagesDirectory, absoluteUri);

managedWebPage.StreamWriter =

new StreamWriter(managedWebPage.DiscoveryPath, false, Encoding.UTF8);

managedWebPage.StreamWriter.Write(source2);

}

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,692 Posts

And when you view the webpage you see the characters?

What is the AbsoluteUri of the WebPage?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Yes. I see the charaters. http://www.XXX.tv/ 

It is not really matter what site I do crawl, when the characters are hebrew the crawld content loses the characters encoding.

Top 10 Contributor
1,692 Posts

OK.  I'll check it out - should be an easy fix.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

arachnode.net:

OK.  I'll check it out - should be an easy fix.

When you do please post. I will test it and feeds back.

 

Top 10 Contributor
229 Posts

Hi Mike, I created a fix for this issue. can you please update

Change from first line to second line on CrawlRequestManager page:

//crawlRequest.DecodedHtml =

HttpUtility.HtmlDecode(Encoding.UTF8.GetString(crawlRequest.Data));
crawlRequest.DecodedHtml =
HttpUtility.HtmlDecode(Encoding.Default.GetString(crawlRequest.Data));

Thanks.

Top 10 Contributor
1,692 Posts
Verified by megetron

I think you are right - I'll check it out tonight and update the trunk.

Thanks megetron.

-Mike

OK - fixed and checked in.

The problem was with how the file was being saved.  I was using a FileStream but needed to use a StreamWriter and specify UTF8 as the encoding.

if

 

(saveWebPageToDisk)

{

managedWebPage.DiscoveryPath =

DiscoveryManager.GetDiscoveryPath(ApplicationSettings.DownloadedWebPagesDirectory, absoluteUri);

managedWebPage.StreamWriter =

new StreamWriter(managedWebPage.DiscoveryPath, false, Encoding.UTF8);

managedWebPage.StreamWriter.Write(source2);

}

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Thank you Mike.

I test it on several sites and it is the correct solution. I will let you know if I have anny issues for encoding.

Top 10 Contributor
1,692 Posts

Great!  Thanks for testing!  Smile

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

The fix save the data to the folder correctly.

but you still have to add the piece of code I have sent to the trunk, because when you debug inside the plugn, crawlRequest.DecodedHtml still contains the incorrect characters.

the DEFAULT value in the code I sent is figure out what the charset of the page and save it, but ofcourse if you can find the code that still save the UTF fonts it will be better.

Top 10 Contributor
1,692 Posts

Really?  I need to change UTF8 to Default for it to work for you?

Is this location?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Yes. That's the location.

The image you just post shows encoding incorrect. the fonts are all upside down,

Top 10 Contributor
1,692 Posts

megetron:

Yes. That's the location.

The image you just post shows encoding incorrect. the fonts are all upside down,

So, when I change the encoding to 'Default' this is what I get...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,692 Posts

megetron:

Yes. That's the location.

The image you just post shows encoding incorrect. the fonts are all upside down,

The fonts are a little small for me to view, but taking a look at the HTML view through the QuickWatch window returns a WebPage that looks like just the original WebPage as viewed through IE7.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,692 Posts

I checked in some code which is a HACK while I research a proper solution to detect encoding.  Let me know what you find.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (16 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC