How can I convert a ISO 8859-2 to UTF?

Jan 23, 2009 at 3:53am
How can I convert a ISO 8859-2 and ISO 8859-9 to UTF in c++?
Is there something similar in c++ as in Java ?
In Java we can simply create a new string(oldstring, "8859-2");
Is it that simple in c++?

More Info:
I am getting the name of a service provider in HEXA format in a XML file.
After parsing the XML file I am able to extract, the HEXA string.
But in the C++ application, I need to display the characters and not the HEXA. These characters are encoded in ISO 8859-2 and 8859-9 formats...
Jan 23, 2009 at 4:57am
Well, I have a couple of functions here, one converts a wchar_t array to a char array encoded as UTF-8, and the other a char array encoded as ISO-8859-1 to a wchar_t array. I suppose you could modify the second one to read the proper encoding.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
wchar_t *ISO88591_to_WChar(const char *buffer,long initialSize,long *finalSize);
char *WChar_to_UTF8(const wchar_t *buffer,long initialSize,long *finalSize);


wchar_t *ISO88591_to_WChar(const char *buffer,long initialSize,long *finalSize){
	wchar_t *res=new wchar_t[initialSize];
	*finalSize=initialSize;
	for (long a=0;a<initialSize;a++)
		res[a]=((wchar_t)buffer[a])&0xFF;
	return res;
}

//Support function. Doesn't need to be called.
long getUTF8size(const wchar_t *buffer,long size){
	long res=0;
	for (long a=0;a<size;a++){
		if (buffer[a]<0x80)
			res++;
		else if (buffer[a]<0x800)
			res+=2;
		else
			res+=3;
	}
	res+=3;
	return res;
}

char *WChar_to_UTF8(const wchar_t *buffer,long initialSize,long *finalSize){
	long fSize=getUTF8size(buffer,initialSize);
	char *res=new char[fSize];
	long b=0;
	for (long a=0;a<initialSize;a++,b++){
		wchar_t character=buffer[a];
		if (character<0x80)
			res[b]=(char)character;
		else if (character<0x800){
			res[b++]=(character>>6)|192;
			res[b]=character&63|128;
		}else{
			res[b++]=(character>>12)|224;
			res[b++]=((character&4095)>>6)|128;
			res[b]=character&63|128;
		}
	}
	*finalSize=fSize;
	return res;
}
Last edited on Jan 23, 2009 at 4:58am
Jan 23, 2009 at 5:38am
Thanks for the function.
But I think I couldn't explain the problem clearly.

Lets take a character,

For example, the ‘ű’ (‘0xFB’ in ISO/IEC 8859-2, i.e. ‘\u0171’) if I do the normal HEXA to ASCII conversion I get the character but it has been replaced by the ‘û’ (‘0xFB’ in ISO/IEC 8859-1, i.e. ‘\u00FB’).

So, basically, there is no conversion needed, but just a format specifier kinf of thing which convert the HEXA to corresponding characters as per the encoding table.

In JAVA, we have something like,

new String(“ABéû”); // “ABéû” (because default encoding on our system is ISO/IEC 8859-1)

new String(“ABéû”, “8859-1”); // “ABéû”

new String(“ABéû”, “8859-2”); // “ABéű” (see the difference with lines above, for the ‘u’ char)





Jan 23, 2009 at 6:16am
No, you explained it perfectly.

There's no such thing as hex to ASCII conversion. ASCII is not a numeral system, but a code page. A char doesn't hold information on the code page its value is in. If you assign 0xFB to a char and read that as ASCII, then of course you'll get garbage, because that's not ASCII. Then the problem is not in how you converted the value from hex to native, but in how you're interpreting the data. Since there's no reliable way to change the code page the system assumes textual data is in, you'll have to change the code page of the data. Unlike changing encodings, which is mostly an arithmetical operation, there's probably no generic formula to apply that will change a value from one code page to another. This means that you need a conversion table.

You should be able to process these without much difficulty:
http://www.haible.de/bruno/charsets/conversion-tables/ISO-8859-2.tar.bz2
http://www.haible.de/bruno/charsets/conversion-tables/ISO-8859-9.tar.bz2
For the first tarball, unicode.org-mappings/ISO8859/8859-2.TXT
For the second, unicode.org-mappings/ISO8859/8859-9.TXT
Those two files have the conversion tables you need (ISO-8859-2 -> Unicode and ISO-8859-9 -> Unicode).
Your processed tables should look like this:
1
2
3
4
5
6
7
8
9
10
11
wchar_t ISO88592_to_Unicode[0x100]={
	0x00, //ISO-8859-2+00 -> U+0000
	0x01, //ISO-8859-2+01 -> U+0001
	//...
	0x0107, //ISO-8859-2+E6 -> U+0107
	//...
};

wchar_t ISO88599_to_Unicode[0x100]={
	//...
};

Yep. Producing conversion tables is a lot of work. I once had to produce one for Shift JIS to Unicode and it was a nightmare.
You can also try to find one yourself
Jan 23, 2009 at 7:02am
Pardon my understanding capabilities, but I didn't understand completely.
I tried your previous function on a input string which was,

10000243696EE96D61, where



100002 -> indicated that the encoding format is ISO 8859-2

I remove the encoding part and convert the remaining digits to fom an array which pairs two digits such as 43, 69, 6E ...
which represent Cinéma (observe é)

Now, when i convert this to wchar_t i get 0x0043, 0x0069, 0x006E, 0x00E9, 0x006D & 0x0061

After this I try to convert the wchar_t to UTF and i get the following,
Cinéma. But I do not expect this.
I expect to display Cinéma.

Addtional info:
If i change the regional language setting (for non unicode) to Hungarian and try to display the string (without converting to anything) I get the correct display!!! When I change back the setting to English, i do not see the correct string.
Jan 23, 2009 at 9:47pm
After this I try to convert the wchar_t to UTF and i get the following,
Cinéma. But I do not expect this.
I expect to display Cinéma.

Yes. Like I said, your display system (whether it's the console or a debugger or whatever) is assuming that the string's code page is ISO-8859-1, which is wrong. "Cinéma" just so happens to be the correct UTF-8 representation of the Unicode string "Cinéma". I guess 'é' has the same code points both in ISO-8859-2 and Unicode.
Apparently, the Hungarian setting accepts UTF-8 and displays the correct characters.

The function I gave you has almost everything you need to convert from ISO-8859-* to wchar_t, you just need to prepare the conversion tables I told you about.
The proper modification would look like this:
1
2
3
4
5
6
7
8
9
10
11
12
wchar_t *ISO88592_to_WChar(const char *buffer,long initialSize,long *finalSize){
	wchar_t *res=new wchar_t[initialSize];
	*finalSize=initialSize;
	for (long a=0;a<initialSize;a++)
		res[a]=((wchar_t)buffer[a])&0xFF;
	//New code:
	for (long a=0;a<initialSize;a++)
		//This line handles the conversion all by itself:
		res[a]=ISO88592_to_Unicode[res[a]];
	return res;
}
Last edited on Jan 23, 2009 at 9:47pm
Jan 29, 2009 at 12:33pm
thanks for all the help.

what i discovered in last 2 days is in C++ builder for the column control we have something called font->Charset...

we can set the charset to EASTEUROPE_CHARSET for ISO_8859_2 and TURKISH_CHARSET for ISO_8859_9.

I have tried this and the results are as expected.

We just have to do one single operation, that is remove the encoding format from the HEXA string and pair the adjacent 2 digits and form a char array. and after that voila... its done...

thank you again for all the inputs...
Topic archived. No new replies allowed.