How can I convert a ISO 8859-2 and ISO 8859-9 to UTF in c++?
Is there something similar in c++ as in Java ?
In Java we can simply create a new string(oldstring, "8859-2");
Is it that simple in c++?
More Info:
I am getting the name of a service provider in HEXA format in a XML file.
After parsing the XML file I am able to extract, the HEXA string.
But in the C++ application, I need to display the characters and not the HEXA. These characters are encoded in ISO 8859-2 and 8859-9 formats...
Well, I have a couple of functions here, one converts a wchar_t array to a char array encoded as UTF-8, and the other a char array encoded as ISO-8859-1 to a wchar_t array. I suppose you could modify the second one to read the proper encoding.
Thanks for the function.
But I think I couldn't explain the problem clearly.
Lets take a character,
For example, the ‘ű’ (‘0xFB’ in ISO/IEC 8859-2, i.e. ‘\u0171’) if I do the normal HEXA to ASCII conversion I get the character but it has been replaced by the ‘û’ (‘0xFB’ in ISO/IEC 8859-1, i.e. ‘\u00FB’).
So, basically, there is no conversion needed, but just a format specifier kinf of thing which convert the HEXA to corresponding characters as per the encoding table.
In JAVA, we have something like,
new String(“ABéû”); // “ABéû” (because default encoding on our system is ISO/IEC 8859-1)
new String(“ABéû”, “8859-1”); // “ABéû”
new String(“ABéû”, “8859-2”); // “ABéű” (see the difference with lines above, for the ‘u’ char)
There's no such thing as hex to ASCII conversion. ASCII is not a numeral system, but a code page. A char doesn't hold information on the code page its value is in. If you assign 0xFB to a char and read that as ASCII, then of course you'll get garbage, because that's not ASCII. Then the problem is not in how you converted the value from hex to native, but in how you're interpreting the data. Since there's no reliable way to change the code page the system assumes textual data is in, you'll have to change the code page of the data. Unlike changing encodings, which is mostly an arithmetical operation, there's probably no generic formula to apply that will change a value from one code page to another. This means that you need a conversion table.
Yep. Producing conversion tables is a lot of work. I once had to produce one for Shift JIS to Unicode and it was a nightmare.
You can also try to find one yourself
Pardon my understanding capabilities, but I didn't understand completely.
I tried your previous function on a input string which was,
10000243696EE96D61, where
100002 -> indicated that the encoding format is ISO 8859-2
I remove the encoding part and convert the remaining digits to fom an array which pairs two digits such as 43, 69, 6E ...
which represent Cinéma (observe é)
Now, when i convert this to wchar_t i get 0x0043, 0x0069, 0x006E, 0x00E9, 0x006D & 0x0061
Addtional info:
If i change the regional language setting (for non unicode) to Hungarian and try to display the string (without converting to anything) I get the correct display!!! When I change back the setting to English, i do not see the correct string.
The function I gave you has almost everything you need to convert from ISO-8859-* to wchar_t, you just need to prepare the conversion tables I told you about.
The proper modification would look like this:
1 2 3 4 5 6 7 8 9 10 11 12
wchar_t *ISO88592_to_WChar(constchar *buffer,long initialSize,long *finalSize){
wchar_t *res=newwchar_t[initialSize];
*finalSize=initialSize;
for (long a=0;a<initialSize;a++)
res[a]=((wchar_t)buffer[a])&0xFF;
//New code:
for (long a=0;a<initialSize;a++)
//This line handles the conversion all by itself:
res[a]=ISO88592_to_Unicode[res[a]];
return res;
}
what i discovered in last 2 days is in C++ builder for the column control we have something called font->Charset...
we can set the charset to EASTEUROPE_CHARSET for ISO_8859_2 and TURKISH_CHARSET for ISO_8859_9.
I have tried this and the results are as expected.
We just have to do one single operation, that is remove the encoding format from the HEXA string and pair the adjacent 2 digits and form a char array. and after that voila... its done...