I am crawling a website with hebrew fonts and the pages that saved in the webpages folders are saved with jibrish fonts:
why is that?
I think you are right - I'll check it out tonight and update the trunk.
Thanks megetron.
-Mike
OK - fixed and checked in.
The problem was with how the file was being saved. I was using a FileStream but needed to use a StreamWriter and specify UTF8 as the encoding.
if (saveWebPageToDisk) { managedWebPage.DiscoveryPath = DiscoveryManager.GetDiscoveryPath(ApplicationSettings.DownloadedWebPagesDirectory, absoluteUri); managedWebPage.StreamWriter = new StreamWriter(managedWebPage.DiscoveryPath, false, Encoding.UTF8); managedWebPage.StreamWriter.Write(source2); }
if
(saveWebPageToDisk) { managedWebPage.DiscoveryPath =
{
managedWebPage.DiscoveryPath =
DiscoveryManager.GetDiscoveryPath(ApplicationSettings.DownloadedWebPagesDirectory, absoluteUri); managedWebPage.StreamWriter =
managedWebPage.StreamWriter =
new StreamWriter(managedWebPage.DiscoveryPath, false, Encoding.UTF8); managedWebPage.StreamWriter.Write(source2); }
managedWebPage.StreamWriter.Write(source2);
}
For best service when you require assistance:
Skype: arachnodedotnet
And when you view the webpage you see the characters?
What is the AbsoluteUri of the WebPage?
Yes. I see the charaters. http://www.XXX.tv/
It is not really matter what site I do crawl, when the characters are hebrew the crawld content loses the characters encoding.
OK. I'll check it out - should be an easy fix.
arachnode.net: OK. I'll check it out - should be an easy fix.
When you do please post. I will test it and feeds back.
Hi Mike, I created a fix for this issue. can you please update
Change from first line to second line on CrawlRequestManager page: //crawlRequest.DecodedHtml =
//crawlRequest.DecodedHtml =
HttpUtility.HtmlDecode(Encoding.UTF8.GetString(crawlRequest.Data));crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(Encoding.Default.GetString(crawlRequest.Data));
Thanks.
Thank you Mike.
I test it on several sites and it is the correct solution. I will let you know if I have anny issues for encoding.
Great! Thanks for testing!
The fix save the data to the folder correctly.
but you still have to add the piece of code I have sent to the trunk, because when you debug inside the plugn, crawlRequest.DecodedHtml still contains the incorrect characters.
the DEFAULT value in the code I sent is figure out what the charset of the page and save it, but ofcourse if you can find the code that still save the UTF fonts it will be better.
Really? I need to change UTF8 to Default for it to work for you?
Is this location?
Yes. That's the location.
The image you just post shows encoding incorrect. the fonts are all upside down,
megetron: Yes. That's the location. The image you just post shows encoding incorrect. the fonts are all upside down,
So, when I change the encoding to 'Default' this is what I get...
The fonts are a little small for me to view, but taking a look at the HTML view through the QuickWatch window returns a WebPage that looks like just the original WebPage as viewed through IE7.
I checked in some code which is a HACK while I research a proper solution to detect encoding. Let me know what you find.