People,
hopefully there are (enough) simple answers: in what ways should utf8 strings be handled in C? And C++? And what about wide chars -- what real application do they see and when should I bother making a program to support them? (well, Qt uses utf-16, which might be better for japanese and similar, but utf-16 is more failure-prone for content storage) Is there any chance, that utf-8 is going to be replaced by utf-16?
How much is it the case, that projects define their own strings and why?
The way to handle them is about the same for C and C++ -- only the language methodology will differ.
ATM, you can still get away with ASCII programs, but as we move into the future there is a good interest in making your programs capable of handling multiple languages.
UTF-16 is not more failure-prone for content storage. There are some caveats to its use, but the same is true of UTF-8.
There is no chance that one will supplant the other. Both are simply encodings. Microsoft Windows systems tend towards UTF-16. POSIX systems tend towards UTF-8 (with UTF-32 internal storage).
Defining the string isn't the problem. Converting between encodings is.
A popular system is iconv, which is very portable and exists across many systems.
Another is IBM's ICU -- though this is very bulky and difficult to set up.
Unfortunately with Unicode, there is no simple answer -- at least not yet. Making your applications use them is itself a very large project. To begin with, choose an internal encoding -- UTF-8, UTF-16, UTF-32, it doesn't really matter which -- and stick with it.
For more information, you can google around "c++ unicode" and the like.
Thank you Duoas!
But why isn't utf-16 more unsafe?
I read in [1]: "If the byte stream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronise at the start of the next good character, " .. " UTF-16 and UTF-32 will handle corrupt (altered) bytes by resynchronizing on the next good character, but a lost or spurious byte (octet) will garble all following text."
So, loosing a byte or erroneously reading it as spurious (from non-sp.) or vice versa is so unlikely? (well, quite enough, luckily) Meanwhile, utf-8 will certainly return to correct readability, because it's not byte-paired any longer than 4 or in extreme 6 bytes (I guess, as per an RFC, 4 including a spurious byte)...