Hi, im very happy for i found this source code, thanks for share. right now i only crawling danish sides. but i got problems, with the danish æøå.is there a way i can fix it, the pages where the letters are wrong,got these tags in the header.
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">or<meta http-equiv="content-type" content="text/html;charset=iso-8859-1">ornothing
so i think when the pages are downloaded it go wrong.
Thanks!
OK - fixed and checked in.
The problem was with how the file was being saved. I was using a FileStream but needed to use a StreamWriter and specify UTF8 as the encoding.
if
(saveWebPageToDisk) { managedWebPage.DiscoveryPath =
{
managedWebPage.DiscoveryPath =
DiscoveryManager.GetDiscoveryPath(ApplicationSettings.DownloadedWebPagesDirectory, absoluteUri); managedWebPage.StreamWriter =
managedWebPage.StreamWriter =
new StreamWriter(managedWebPage.DiscoveryPath, false, Encoding.UTF8); managedWebPage.StreamWriter.Write(source2); }
managedWebPage.StreamWriter.Write(source2);
}
For best service when you require assistance:
Skype: arachnodedotnet
I fixed the rest of the encoding issues. You can now view non-English summaries properly as well as view the Cached versions of the pages properly through search.
Would you supply a site for me to crawl and test against?
Where exactly do you see the incorrect character(s)? What should it be?
thanks, as you can see I get �
here are some sites
first side is it ø and å
http://mc-select.dk/default.asp?TemplateID=279
mc-select MC Hjelme MCSelect Grips | Solbriller | Transport tilbeh�r | St�vler | Garage tilbeh�r | L�se | F�rer Tilbeh�r | Tasker | Tasker ...
it should be
mc-select MC Hjelme MCSelect Grips | Solbriller | Transport tilbehør | Støvler | Garage tilbehør | Låse | Fører Tilbehør | Tasker | Tasker ... second side is it æ and å
mc-select MC Hjelme MCSelect Grips | Solbriller | Transport tilbehør | Støvler | Garage tilbehør | Låse | Fører Tilbehør | Tasker | Tasker ...
second side is it æ and å
http://hvetbo-mc.dk/print.php?type=N&item_id=7
Hvetbo MC Hvetbo MC Layout p� hjemmesiden �ndret Skrevet af Martin d. June 16 2009 09:34:38 Da der var flere brugere som havde fejl p� siden med det andet ... http://hvetbo-mc.dk/print.php?type=N&item_id=7 - Cached (Score:0,9416782 Strength:0 = Total:0) - Explain it should be:
I got my æøå the right way now
I will clean the code i made up and post it, one of the next days.
That would be great! Thanks!
Another user has detected an encoding issue - so, hopefully we can solve this issue soon.
Thanks again,Mike
Change from first line to second line on CrawlRequestManager page: crawlRequest.DecodedHtml =
crawlRequest.DecodedHtml =
HttpUtility.HtmlDecode(Encoding.UTF8.GetString(crawlRequest.Data)); crawlRequest.DecodedHtml =
HttpUtility.HtmlDecode(Encoding.Default.GetString(crawlRequest.Data));
Sorry for not post code, it's because it not worked 100 %, my problem with my code is it
did something to the score, and was still try to fix it, but now I will try your update with the streamwriter.
I will post back with the result.
thanks
No worries. Let me know what you find.
I Just try the latest version and and the problems are the same.
In my code i try to find the charset in the meta tag if none
i try to get it from the Httpresponse so if I find iso-8859-1
then I convert the webpage ( crawlRequest.Data) to utf-8,
this work, but as i say before it then give problems with the score,
so then i try only convert the data there go to crawlRequest.DecodedHtml
not crawlRequest.Data. It not work, æøå wrong again.
hmm
so the verson below give æøå but wrong score and another thing i dont know what to do
if i not find the Charset.
private static void ProcessWebPage(CrawlRequest crawlRequest, WebPageManager webPageManager, ArachnodeDAO arachnodeDAO) { ConsoleManager.OutputWebPageDiscovered(crawlRequest.Crawl.CrawlInfo.ThreadNumber, crawlRequest); Counters.GetInstance().WebPagesDiscovered(1); //crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(Encoding.UTF8.GetString(crawlRequest.Data)); //crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(Encoding.Default.GetString(crawlRequest.Data)); //====================== //Convert crawlRequest.Data document to string System.Text.Encoding encode = System.Text.Encoding.GetEncoding(System.Text.Encoding.Default.CodePage); string TargetStr = encode.GetString(crawlRequest.Data); string Charset = ""; //search in the document meta tags for charset string pattern ="(<meta[^>]*charset[ \t]*=[ \t\"]*)([^<> \r\n\"]*)"; MatchCollection mc = Regex.Matches(TargetStr, pattern, RegexOptions.Multiline | RegexOptions.IgnoreCase); if (mc.Count > 0) { Match TheMatch = mc[0]; GroupCollection GroupCol = TheMatch.Groups; Group TheThirdGroup = GroupCol[2]; Charset = TheThirdGroup.ToString(); } //if no charset found then the default charset if (Charset == "") { System.Net.HttpWebResponse WResponseCharset = (System.Net.HttpWebResponse)crawlRequest.WebClient.WebResponse; Charset = WResponseCharset.CharacterSet; } // Convert to Utf-8 if (Charset != "") { UTF8Encoding utf8 = new UTF8Encoding(); Encoding enc = System.Text.Encoding.GetEncoding(Charset); crawlRequest.Data = Encoding.Convert(enc, utf8, crawlRequest.Data); crawlRequest.DecodedHtml = HttpUtility.HtmlDecode( Encoding.UTF8.GetString(crawlRequest.Data)); } else { crawlRequest.DecodedHtml = HttpUtility.HtmlDecode( Encoding.UTF8.GetString(crawlRequest.Data)); } //====================== webPageManager.ManageWebPage(crawlRequest); /**/ //Email Addresses ProcessEmailAddresses(crawlRequest, arachnodeDAO); /**/ //HyperLinks ProcessHyperLinks(crawlRequest, arachnodeDAO); /**/ //Files and Images ProcessFilesAndImages(crawlRequest, arachnodeDAO); }
private static void ProcessWebPage(CrawlRequest crawlRequest, WebPageManager webPageManager, ArachnodeDAO arachnodeDAO) { ConsoleManager.OutputWebPageDiscovered(crawlRequest.Crawl.CrawlInfo.ThreadNumber, crawlRequest); Counters.GetInstance().WebPagesDiscovered(1);
ConsoleManager.OutputWebPageDiscovered(crawlRequest.Crawl.CrawlInfo.ThreadNumber, crawlRequest);
Counters.GetInstance().WebPagesDiscovered(1);
//crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(Encoding.UTF8.GetString(crawlRequest.Data));
//crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(Encoding.Default.GetString(crawlRequest.Data));
//======================
//Convert crawlRequest.Data document to string
System.Text.Encoding encode = System.Text.Encoding.GetEncoding(System.Text.Encoding.Default.CodePage);
string TargetStr = encode.GetString(crawlRequest.Data);
string Charset = "";
//search in the document meta tags for charset
string pattern ="(<meta[^>]*charset[ \t]*=[ \t\"]*)([^<> \r\n\"]*)"; MatchCollection mc = Regex.Matches(TargetStr, pattern, RegexOptions.Multiline | RegexOptions.IgnoreCase);
MatchCollection mc = Regex.Matches(TargetStr, pattern, RegexOptions.Multiline | RegexOptions.IgnoreCase);
if (mc.Count > 0) { Match TheMatch = mc[0]; GroupCollection GroupCol = TheMatch.Groups; Group TheThirdGroup = GroupCol[2]; Charset = TheThirdGroup.ToString(); }
Match TheMatch = mc[0];
GroupCollection GroupCol = TheMatch.Groups;
Group TheThirdGroup = GroupCol[2];
Charset = TheThirdGroup.ToString();
//if no charset found then the default charset
if (Charset == "") { System.Net.HttpWebResponse WResponseCharset = (System.Net.HttpWebResponse)crawlRequest.WebClient.WebResponse; Charset = WResponseCharset.CharacterSet; }
System.Net.HttpWebResponse WResponseCharset = (System.Net.HttpWebResponse)crawlRequest.WebClient.WebResponse;
Charset = WResponseCharset.CharacterSet;
// Convert to Utf-8
if (Charset != "") {
UTF8Encoding utf8 = new UTF8Encoding();
Encoding enc = System.Text.Encoding.GetEncoding(Charset); crawlRequest.Data =
crawlRequest.Data =
Encoding.Convert(enc, utf8, crawlRequest.Data); crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(
crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(
Encoding.UTF8.GetString(crawlRequest.Data)); }
else
webPageManager.ManageWebPage(crawlRequest);
/**/
//Email Addresses
ProcessEmailAddresses(crawlRequest, arachnodeDAO);
//HyperLinks
ProcessHyperLinks(crawlRequest, arachnodeDAO);
//Files and Images
ProcessFilesAndImages(crawlRequest, arachnodeDAO);
It's odd - the code was working properly for a minute and then stopped. Hmm...
Well, wait...
The crawlRequest.Data works for http://www.myvod.tv but not for http://www.movin.co.il ... frustrating... hmm... I'll look at your site now...
http://www.myvod.tv
but not for
http://www.movin.co.il ... frustrating... hmm... I'll look at your site now...
http://www.movin.co.il
... frustrating... hmm...
I'll look at your site now...
I checked in some code which is a HACK while I research a proper solution to detect encoding. Let me know what you find.
(Still researching...)
-Mike
m_response is type of HttpWebResponse.
Still working on this... there is some overlap with LastModified, so I'm tackling both of those issue at once. Thanks for your patience.
I just checked in EncodingManager.cs.
See if that works for you - test it out and if it looks good we'll include it in the source.