operating with utf8 strings in C/C++

Apr 15, 2010 at 3:24pm
People,
hopefully there are (enough) simple answers: in what ways should utf8 strings be handled in C? And C++? And what about wide chars -- what real application do they see and when should I bother making a program to support them? (well, Qt uses utf-16, which might be better for japanese and similar, but utf-16 is more failure-prone for content storage) Is there any chance, that utf-8 is going to be replaced by utf-16?

How much is it the case, that projects define their own strings and why?
Apr 15, 2010 at 10:15pm
The way to handle them is about the same for C and C++ -- only the language methodology will differ.

ATM, you can still get away with ASCII programs, but as we move into the future there is a good interest in making your programs capable of handling multiple languages.

UTF-16 is not more failure-prone for content storage. There are some caveats to its use, but the same is true of UTF-8.

There is no chance that one will supplant the other. Both are simply encodings. Microsoft Windows systems tend towards UTF-16. POSIX systems tend towards UTF-8 (with UTF-32 internal storage).


Defining the string isn't the problem. Converting between encodings is.
A popular system is iconv, which is very portable and exists across many systems.

Another is IBM's ICU -- though this is very bulky and difficult to set up.


Unfortunately with Unicode, there is no simple answer -- at least not yet. Making your applications use them is itself a very large project. To begin with, choose an internal encoding -- UTF-8, UTF-16, UTF-32, it doesn't really matter which -- and stick with it.

For more information, you can google around "c++ unicode" and the like.

Hope this helps.
Apr 16, 2010 at 10:53am
Thank you Duoas!
But why isn't utf-16 more unsafe?
I read in [1]: "If the byte stream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronise at the start of the next good character, " .. " UTF-16 and UTF-32 will handle corrupt (altered) bytes by resynchronizing on the next good character, but a lost or spurious byte (octet) will garble all following text."

So, loosing a byte or erroneously reading it as spurious (from non-sp.) or vice versa is so unlikely? (well, quite enough, luckily) Meanwhile, utf-8 will certainly return to correct readability, because it's not byte-paired any longer than 4 or in extreme 6 bytes (I guess, as per an RFC, 4 including a spurious byte)...

[1] http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
Apr 16, 2010 at 12:13pm
That is true of any multi-byte encoding. It is not an error in the format itself.

Any decent transmission protocol must account for dropped bits and/or bytes, and most provide a method to repair the damage.

The user of the proper TP can therefore expect to send/receive properly-synchronized bytes from either end.

Hope this helps.
Topic archived. No new replies allowed.