Replacing accented characters in a string

Hi Everyone, I have this problem.

I have to copy a string read from a database into a char buffer replacing the accented characters with the hex values of the corresponding characters of WinAnsiEncoding charset.

"è" --> 0xE8
"é" --> 0xE9
"à" --> 0xE0
"ù" --> 0xF9
"ò" --> 0xF2
"ì" --> 0xEC

This is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
void encodeToWinAnsiEncoding(std::string s, char* buf)
{
    std::string sTmp;
    for(int i=0; i <= s.length()-1; i++) {
      sTmp = subStr(s,i,1);
        if ( sTmp == "è" )
            buf[i] = 0xE8;
        else if ( sTmp == "é" )
            buf[i] = 0xE9;
        else if ( sTmp == "à" )
            buf[i] = 0xE0;
        else if ( sTmp == "ù" )
            buf[i] = 0xF9;
        else if ( sTmp == "ò" )
            buf[i] = 0xF2;
        else if ( sTmp == "ì" )
            buf[i] = 0xEC;
        else
            buf[i] = sTmp[0];
    }
    buf[s.length()] = 0x00;

    return;   
}


The problem is that the if () with the compare for every character I have to replace doesn't work because this characters are multibyte.

Someone can help?

Thank you in advance.
Is the original string UTF-8?

And are you coding for Windows? (your function name suggests so!)

Andy
I'm working with the last version of debian and I'm using a WinAnsiEncoding encodig because I must use the function to write a pdf with libharu but I don't know whether it's really matter.
Anyway, solved the problem next I can encode in ISO8859-16 or other.


Thank you
Last edited on
Cool

Windows-1252 is just a superset of ISO 8859-1. Can you use that rather than -16?

On Linux I've done that kind of conversion using libiconv
Libharu can use this encoding:
"StandardEncoding",
"MacRomanEncoding",
"WinAnsiEncoding",
"ISO8859-2",
"ISO8859-3",
"ISO8859-4",
"ISO8859-5",
"ISO8859-9",
"ISO8859-10",
"ISO8859-13",
"ISO8859-14",
"ISO8859-15",
"ISO8859-16",
"CP1250",
"CP1251",
"CP1252",
"CP1254",
"CP1257",
"KOI8-R",
"Symbol-Set",
"ZapfDingbats-Set"

Because my first need is to replace only the characters èéàùòì and because if call the function textOut with a buffers with the corresponding values, I thought that the fast solution was to copy the string into the buffer and do the replacing. It's the first time I work with multibyte characters.
This may or may not help, but have you tried using single quotes instead of double quotes? The specific characters you are using aren't really multi-byte characters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
void encodeToWinAnsiEncoding(std::string s, char* buf)
{
    std::string sTmp;
    for(int i=0; i <= s.length()-1; i++) {
      sTmp = subStr(s,i,1);
        if ( sTmp == 'è' )
            buf[i] = 'e';
        else if ( sTmp == 'é' )
            buf[i] = 'e';
        else if ( sTmp == 'à' )
            buf[i] = 'a';
        else if ( sTmp == 'ù' )
            buf[i] = 'u';
        else if ( sTmp == 'ò' )
            buf[i] = 'o';
        else if ( sTmp == 'ì' )
            buf[i] = 'i';
        else
            buf[i] = sTmp[0];
    }
    buf[s.length()] = '\0';

    return;   
}
Last edited on
You could consider precomposed Unicode. You get the accent and base characters as seperate bytes that you can replace independently.

ICU supports precomposed/decomposed conversion.

http://en.wikipedia.org/wiki/Precomposed_character
http://site.icu-project.org/
to Stewbond:

just tested and it doesn't works.

I tested this and it doesn't works too:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <iostream>
#include <string>
#include <stdio.h>

using namespace std;

int main()
{
    string E0 = "è";
    string E1;
    E1 = E0.substr(0,1);
    string E2 = "è";
    if (E1 == E2)
        cout << "1. passed" << '\n';
    else
        cout << "1. not passed" << '\n';

    if (E0 == E2)
        cout << "2. passed" << '\n';
    else
        cout << "2. not passed" << '\n';
    cout << "E0: " << E0 << '\n';
    cout << "E1: " << E1 << '\n';
  return(0);
}

And here is the output:
1. not passed
2. passed
E0: è
E1: 


It seems that when you copy the value of E0 into E1 with substr() the value disappear.
to kbw:

thank you, I'll give a try but it seems just a little complex for what I need now, just the replacing of èéàùòì characters.
Solved, the problem was that I was working with string instead of wstring;
Topic archived. No new replies allowed.