arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

codepage and charset

rated by 0 users
Answered (Verified) This post has 2 verified answers | 19 Replies | 3 Followers

Top 50 Contributor
7 Posts
kbh2200 posted on Tue, Jul 28 2009 7:30 PM

Hi, im very happy for i found this source code, thanks for share.
right now i only crawling danish sides. but i got problems,
with the danish æøå.
is there a way i can fix it, the pages where the letters are wrong,
got these tags in the header.

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
or
<meta http-equiv="content-type" content="text/html;charset=iso-8859-1">
or
nothing

so i think when the pages are downloaded it go wrong.

Answered (Verified) Verified Answer

Top 10 Contributor
1,694 Posts

Thanks!

OK - fixed and checked in.

The problem was with how the file was being saved.  I was using a FileStream but needed to use a StreamWriter and specify UTF8 as the encoding.

if

 

 

(saveWebPageToDisk)

{

managedWebPage.DiscoveryPath =

DiscoveryManager.GetDiscoveryPath(ApplicationSettings.DownloadedWebPagesDirectory, absoluteUri);

managedWebPage.StreamWriter =

new StreamWriter(managedWebPage.DiscoveryPath, false, Encoding.UTF8);

managedWebPage.StreamWriter.Write(source2);

}

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

I fixed the rest of the encoding issues.  You can now view non-English summaries properly as well as view the Cached versions of the pages properly through search.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,694 Posts

Would you supply a site for me to crawl and test against?

Where exactly do you see the incorrect character(s)?  What should it be?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
7 Posts

thanks, as you can see I get �

here are some sites

first side is it ø and å

http://mc-select.dk/default.asp?TemplateID=279

mc-select MC Hjelme MCSelect Grips | Solbriller | Transport tilbeh�r | St�vler | Garage tilbeh�r | L�se | F�rer Tilbeh�r | Tasker | Tasker ...

it should be

mc-select MC Hjelme MCSelect Grips | Solbriller | Transport tilbehør | Støvler | Garage tilbehør | Låse | Fører Tilbehør | Tasker | Tasker ...

second side is it æ and å

http://hvetbo-mc.dk/print.php?type=N&item_id=7 

Hvetbo MC Hvetbo MC Layout p� hjemmesiden �ndret Skrevet af Martin d. June 16 2009 09:34:38 Da der var flere brugere som havde fejl p� siden med det andet ... http://hvetbo-mc.dk/print.php?type=N&item_id=7 - Cached (Score:0,9416782 Strength:0 = Total:0) - Explain it should be:

Hvetbo MC Hvetbo MC Layout på hjemmesiden ændret Skrevet af Martin d. June 16 2009 09:34:38 Da der var flere brugere som havde fejl på siden med det andet ... http://hvetbo-mc.dk/print.php?type=N&item_id=7 - Cached (Score:0,9416782 Strength:0 = Total:0) - Explain

Top 50 Contributor
7 Posts

I got my æøå the right way now 

 I will clean the code i made up and post it, one of the next days.

 

Top 10 Contributor
1,694 Posts

That would be great!  Thanks!

Another user has detected an encoding issue - so, hopefully we can solve this issue soon.

Thanks again,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Change from first line to second line on CrawlRequestManager page:

crawlRequest.DecodedHtml =

HttpUtility.HtmlDecode(Encoding.UTF8.GetString(crawlRequest.Data));

crawlRequest.DecodedHtml =

HttpUtility.HtmlDecode(Encoding.Default.GetString(crawlRequest.Data));

Top 10 Contributor
1,694 Posts

Thanks!

OK - fixed and checked in.

The problem was with how the file was being saved.  I was using a FileStream but needed to use a StreamWriter and specify UTF8 as the encoding.

if

 

 

(saveWebPageToDisk)

{

managedWebPage.DiscoveryPath =

DiscoveryManager.GetDiscoveryPath(ApplicationSettings.DownloadedWebPagesDirectory, absoluteUri);

managedWebPage.StreamWriter =

new StreamWriter(managedWebPage.DiscoveryPath, false, Encoding.UTF8);

managedWebPage.StreamWriter.Write(source2);

}

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
7 Posts

Sorry for not post code, it's because it not worked 100 %, my problem with my code is it

did something to the score, and was still try to fix it, but now I will try your update with the streamwriter.

I will post back with the result.

thanks

 

 

Top 10 Contributor
1,694 Posts

No worries.  Let me know what you find.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
7 Posts

I Just try the latest version and and the problems are the same.

In my code i try to find the charset in the meta tag if none

i try to get it from the Httpresponse so if I find iso-8859-1

then I convert the webpage ( crawlRequest.Data) to utf-8,

this work, but as i say before it then give problems with the score,

so then i try only convert the data there go to crawlRequest.DecodedHtml

not crawlRequest.Data.  It not work, æøå wrong again.

hmm

so the verson below give æøå but wrong score and another thing i dont know what to do

if i not find the Charset.  

 

private static void ProcessWebPage(CrawlRequest crawlRequest, WebPageManager webPageManager, ArachnodeDAO arachnodeDAO)

{

ConsoleManager.OutputWebPageDiscovered(crawlRequest.Crawl.CrawlInfo.ThreadNumber, crawlRequest);

Counters.GetInstance().WebPagesDiscovered(1);

 

//crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(Encoding.UTF8.GetString(crawlRequest.Data));

 

//crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(Encoding.Default.GetString(crawlRequest.Data));

 

//======================

//Convert crawlRequest.Data document to string

System.Text.Encoding encode = System.Text.Encoding.GetEncoding(System.Text.Encoding.Default.CodePage);

 

string TargetStr = encode.GetString(crawlRequest.Data);

 

string Charset = "";

 

//search in the document meta tags for charset

 

string pattern ="(<meta[^>]*charset[ \t]*=[ \t\"]*)([^<> \r\n\"]*)";

MatchCollection mc = Regex.Matches(TargetStr, pattern, RegexOptions.Multiline | RegexOptions.IgnoreCase);

 

if (mc.Count > 0) {

Match TheMatch = mc[0];

GroupCollection GroupCol = TheMatch.Groups;

Group TheThirdGroup = GroupCol[2];

Charset = TheThirdGroup.ToString();

}

 

//if no charset found then the default charset

 

if (Charset == "")

{

System.Net.HttpWebResponse WResponseCharset = (System.Net.HttpWebResponse)crawlRequest.WebClient.WebResponse;

Charset = WResponseCharset.CharacterSet;

}

 

// Convert to Utf-8

 

if (Charset != "")

{

 

UTF8Encoding utf8 = new UTF8Encoding();

 

Encoding enc = System.Text.Encoding.GetEncoding(Charset);

crawlRequest.Data =

Encoding.Convert(enc, utf8, crawlRequest.Data);

crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(

Encoding.UTF8.GetString(crawlRequest.Data));

 

}

 

else

{

crawlRequest.DecodedHtml = HttpUtility.HtmlDecode(

Encoding.UTF8.GetString(crawlRequest.Data));

}

 

//======================

webPageManager.ManageWebPage(crawlRequest);

 

/**/

 

//Email Addresses

ProcessEmailAddresses(crawlRequest, arachnodeDAO);

 

/**/

 

//HyperLinks

ProcessHyperLinks(crawlRequest, arachnodeDAO);

 

/**/

 

//Files and Images

ProcessFilesAndImages(crawlRequest, arachnodeDAO);

}

 

 

  

 

Top 10 Contributor
1,694 Posts

It's odd - the code was working properly for a minute and then stopped.  Hmm...

Well, wait...

The crawlRequest.Data works for

http://www.myvod.tv

but not for

http://www.movin.co.il

... frustrating... hmm...

I'll look at your site now...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

I checked in some code which is a HACK while I research a proper solution to detect encoding.  Let me know what you find.

(Still researching...)

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts
VB solution maybe? I hope that class can help us:
    Public Function GetContent() As String
        If m_content = Nothing Then
            Dim dataStream As Stream = m_response.GetResponseStream()
            Using reader As New StreamReader(dataStream, Text.Encoding.Default)

                m_content = reader.ReadToEnd()
            End Using
        End If

        Return m_content
    End Function

    Public Function GetDecodedContent() As String
        Dim raw As String = GetContent()
        Dim dest_enc As Encoding = GetEncoding()
        Dim output As String

        output = Encoding.Default.GetChars(Encoding.Convert(dest_enc, Encoding.Default, Encoding.Default.GetBytes(raw)))

        Return output
    End Function

    Public Function GetEncodingHeader() As Encoding
        Dim output As Encoding = Nothing
        If m_response.ContentEncoding <> "" Then
            Try
                output = Encoding.GetEncoding(m_response.ContentEncoding)
            Catch
                output = Nothing
            End Try
        End If

        Return output
    End Function

    Public Function GetEncoding() As Encoding
        Dim pattern As String = "
<\s*(meta\s*http-equiv=\s*["']?)?content-type\s*[:"']?\s*(content\s*=\s*)?["']?\s*[\w/\\]*\s*;\s*charset\s*=\s*(?<encoding>[^"';>]*)\s*["']?\s*/?\s*>"
        Dim m_RegEx As Regex
        Dim encMatch As Match
        Dim enc As Encoding = Encoding.GetEncoding("windows-1255")

        If m_content = Nothing Then
            Throw New Exception("First call GetContent()")
        End If

        If GetEncodingHeader() IsNot Nothing Then
            Return GetEncodingHeader()
        Else
            m_RegEx = New Regex(pattern, RegexOptions.IgnoreCase Or RegexOptions.Compiled)
            encMatch = m_RegEx.Match(m_content)

            If encMatch.Success Then
                Try
                    enc = Encoding.GetEncoding(encMatch.Groups("encoding").ToString.ToLower)
                Catch
                End Try
            End If
        End If

        Return enc
    End Function


 m_response  is type of HttpWebResponse.

Top 10 Contributor
1,694 Posts

Still working on this... there is some overlap with LastModified, so I'm tackling both of those issue at once.  Thanks for your patience.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

I just checked in EncodingManager.cs.

See if that works for you - test it out and if it looks good we'll include it in the source.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (20 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC