Specializing char_traits with your own char type (UTF-8 issue clear-up)

Feb 24, 2011 at 4:54pm
I know UTF-8 question have been asked probably from the very dawn of C++ time and that someone probably has answered this, but no googling has so far provided me with a clear answer for the following situation:

Suppose:
- uchar is your own type built to store one UTF-8 character (NOT a typedef for unsigned char!),
- you've done a proper specialization of char_traits for this uchar_type.

Can you then instantiate basic_string with your uchar type and have it provide proper UTF-8 functionality as just another STL string class?

If no, why not?
Obvious sub-question: is it possible to properly specialize char_traits for your own variable-length type?

('s my C++ course task, implementing a UTF-8 string. I'm thinking this would be a neat way of doing it, neater than coding another UTF-8 string class from scratch.)

Thanks!
--Jan
(MFF UK, Prague)
Feb 24, 2011 at 10:17pm
I know UTF-8 question have been asked probably from the very dawn of C++ time and that someone probably has answered this, but no googling has so far provided me with a clear answer for the following situation:
It has been asked since before C++, and the reason there is no one easy answer is twofold: one, the answer just isn't easy; and two, other methodologies already exist to do it without much grief in C++.

Suppose:
...
Can you then instantiate basic_string with your uchar type and have it provide proper UTF-8 functionality as just another STL string class?
That is actually a very loaded question. The answer is, "Yes, with caveats."

Putting all the UTF-8 stuff into a string class is a mistake. Yes, I know other people have done it, relatively successfully even (I've even done it myself), but the string class is the wrong place to handle encoding issues. Remember, do one thing at a time. Store a string of characters. Transform a string. Read and write a string from a stream. These are three separate things.

The STL iostreams classes are designed to handle the last two. Unfortunately it only does the last one very well -- transformation is one of the "black box" issues in the STL -- meaning that it is an "implementation defined" nightmare.

Obvious sub-question: is it possible to properly specialize char_traits for your own variable-length type?
Yes, maybe. I say, "yes," because you are supposed to be able to given the design of C++, but "maybe" because of the previously mentioned black arts in implementation design. Even very nice implementations of the STL are noticibly lacking when it comes to handling specialized traits classes properly -- particularly in the streams interface.

I'm thinking this would be a neat way of doing it, neater than coding another UTF-8 string class from scratch.
I agree, but unfortunately the STL is broken when it comes to things like this. The problem isn't just in various implementations (which is usually the showstopper) but in the design pattern behind it all to begin with... (Remember, the STL is still something of an experiment in effective design patterns.)

As for the "from scratch" part, it is actually an easier solution to just call a function to convert the string than it is to do it the templated way... Functional methods tend to spread state around in cases like this, where a procedural method is more straightforward.


Fortunately, it is possible to have platform-independent, pure C++ code that implements a codecvt facet for UTF encoding. Alas, how to do this is black magic. I've got one in the works, but I haven't had time to mess with it lately...

Keep in mind that handling Unicode is more complicated than just UTF encodings, and you will usually have to have an additional library to deal with all that anway... Usually such libraries can do UTF encoding and decoding for you...

Hope this helps.
Feb 24, 2011 at 10:42pm
It does help, very much so. Thanks!

Mine's not an encoding issue per se, it's A) a school assignment issue, B) I'll probably be using the UTF string in a bigger project which handles large amounts of linguistic data stored in UTF, and I want to be able to work on them without converting, precisely because of the "implementation defined" Things from Beyond.

(
Fortunately, it is possible to have platform-independent, pure C++ code that implements a codecvt facet for UTF encoding. Alas, how to do this is black magic. I've got one in the works, but I haven't had time to mess with it lately...


...sounds a bit like Gandalf... <[:{>
)

Off to code, then. Thanks again!
Feb 25, 2011 at 2:12am
LOL, it isn't that complicated. Most of the grief comes from being forced to use the stock mbstate_t...

If you want I can give you the basics to getting started...
Topic archived. No new replies allowed.