character encoding

Forum

Forum
General C++ Programming
character encoding

character encoding

Hi,
I am really messed with the encoding schemes. I could not understands how to deal with the the encoding. So, I have decided may be here I'll get to know about it.

I think I should start with what I know, so one can understand where all the problems are:
{i am using visual studio}

1
2

char c1 = 'δ';
char c2 = 'a';

Since, the size of char is 1 byte, then how c1 could hold this value and write this this character to file perfectly.

1
2

char         *pToChar    =   "abcdéáúõûóüö";
wchar_t   *pTowchar  =   L"abcdéáúõûóüö";

How pTochar here maintain this non ASCII character strings.

What i know is that for storing ASCII character we need 1byte of memory.
For UTF-8 is variable width,
UTF-16 needs 2-byte.
UTF-32 needs 4 byte.

Any guideline and statement for clearing my doubt will be greatly appreciated.

Thanks in advance !

Last edited on

dearvivekkumar (104)

Any explanation please!

simeonz (490)

Just curious. How did you establish that the small greek delta is written to the file as such?

Internationalization with C++ is currently so complex and unreliable (what little there is) that I never bothered myself to study it. But the string text you have used in the second snippet does not actually employ codepoints that require wide characters. These accented european letters have codes in the unused range above the 7-bit ASCII that fit in simple 8-bit char-s.

Regarding the delta, I wish I knew what you have done to make it work, so that I can try to reproduce it on my gcc compiler :)

Regarding the encodings - UTF8/16/32 - those are not designed for C++, neither the 03 standard tries to associate any features with them as a standard (, except for the source encoding). UTF-8/16 are not even indexable, because they employ variable-length sequences. Consequently, they can not be used for C++ strings. UTF-32 can be used for indexable storage in memory, but is effectively supported only if wchar_t is 32-bit, which is the case in gcc. MS compilers use 16-bit wchar_t, because the OS employs 16-bit character codes also, and this corresponds to UCS-2 encoding. UCS-2 is a poor man's UTF-16 with some codepoints missing from the encoding scheme in order to keep the code length fixed.

So:
UTF-8 uses 1 to 4 byte variable length codes, is backward compatible with ASCII.
UTF-16 uses 1 or 2 16-bit word variable length codes, and is not backward compatible with ASCII.
UTF-32 uses 32-bit word fixed length codes, and is not backward compatible with ASCII.

All of those are file encodings, not necessarily in-memory representations.

A very good brief description of the differences between codepoints, characters, and graphemes is provided in the following post, which also emphasizes that supporting the scripts of different languages would be much more serious than simply encoding/decoding:
http://forum.osdev.org/viewtopic.php?p=17836#p17836

Regards

Topic archived. No new replies allowed.