Filenames and Unicode

Jun 19, 2009 at 1:40am
Most modern applications use a GUI which deals with Unicode strings represented either as wchar_t * or std::wstring or some GUI specific equivalent, such as the wxWidgets wxString. Furthermore XML files are almost always in Unicode, represented generally as UTF-8. The use of Unicode is central to internationalization, since English is almost the only language which can be represented without the use of accented or non-western characters.

However support for Unicode is almost completely absent from the definition of the C and C++ standard libraries. The word literally does not occur in the standards documents. For example there is no equivalent of the fopen function that accepts a Unicode file name, nor is there a form of std::fstream which can be constructed with a Unicode file name.

The justification given for this restriction in the standards is apparently that there are file-systems out there, for example FAT and the extn family, that support only 8-bit characters in file names. However the popular NTFS and HFS file systems do support at least a large subset of Unicode file names. So the standard C and C++ libraries cannot be used to access some files stored on those file systems. It seems obvious to me that the fact that a few file systems cannot handle certain file names is no reason for the libraries not to support them.

In searching the web the only advice I can find for people trying to map between Unicode strings and the char * parameters demanded by the standard libraries is to call wcstombs. The first problem with wcstombs is that the only documentation given for it on any platform I have checked is essentially a quote from the standards. The basic term "multiple-byte character" is never formally defined. A further inexactitude is that the nature of the translation is subjected to an undocumented dependence on locale. Since any specific implementation must be deterministic, it would be nice if someone would take the time to document exactly what their implementation does. I would imagine that most current implementations perform something along the line of the translation defined for UTF-32 to UTF-8. However if that is the case then there is a problem with the advice to use wcstombs to translate a Unicode string to a char *: most file systems, including the most popular file systems on Unix, permit characters in the high portion of the code page to be used in file names. However if those characters are present in a Unicode string wcstombs will convert them to TWO characters each, which will not match what the file system is expecting. wcstombs will only work if all characters in the file name are restricted to the bottom half of the first code page, or if there is a special Locale for wcstombs that rejects all characters that are not in code page 0, and represents all of the characters in code page 0 in single characters each. But in order for such a Locale to be usable it would have to be defined as part of the C and C++ standards so that code using wcstombs for this case would be portable.

It is quite frustrating that the organizations that contribute to the C and C++ standards, which includes mostly those organizations that produce C and C++ compilers, have chosen not to address this portability issue.

Jun 19, 2009 at 2:00am
So the standard C and C++ libraries cannot be used to access some files stored on those file systems.
Correction: they can't be used to access files with names that use characters above the 0xFF code point.

Do you know what "code page" means? Because I think you don't.

The answer is quite short: if you need to open a file with a Unicode name, use a third party library. Boost is not a bad choice.
Last edited on Jun 19, 2009 at 2:01am
Jun 19, 2009 at 3:15am
I share the OP's frustration at lack of Unicode support pretty much everywhere in standard libs. Which is why I stay arm's length away from standard file IO and std::string (as well as everything that uses them -- since afterall, they're useless). Even boost lacks a good string class (at least last time I checked it -- perhaps I missed it).

I ended up writing my own classes (actually I'm working on a full lib) which sort of addresses these issues. utf8, utf16, utf32 string classes, all interchangable (conversion done behind the scenes), with Unicode aware length calculations (it differentiates between the "length" and "size" of the string), an iterator to step through and get utf32 codepoints one-by-one (so you don't have to do any utf encoding/decoding in the program). The strings work with my custom File I/O classes which add other functionality standard libs sadly lack (length of file? file resizing/truncation? endian-safe integer writing?), etc, etc.

I can't recommend this route to everone. I'm a do-it-yourselfer at heart, so I didn't really mind. Plus I'm a hobbyist, so it's not like I had to burn money spending professional time on it. But yeah, it is rather frustrating how utterly useless a large portion of the standard lib becomes as soon as you want to go beyond the bare minimum.


Also code pages are becoming more and more of an outdated concept. Virtually everything in the modern PC world is Unicode these days. Having to learn about code pages and how to manipulate them properly is arguably much harder than just learning how to get Unicode to work.
Jun 19, 2009 at 3:37am
I have to say, I don't see anything wrong with the standard strings library. The fact that it doesn't UTF is just a design choice. Personally, I think it's retarded to use UTF for internal representation, just like I think it's retarded to use UCS-n for storage.
Jun 19, 2009 at 3:44am
POSIX systems typically allow UTF-8 encoded filenames.

Win32 systems (NTFS) permit 16-bit character names. You cannot use the normal methods, but you can actually open a standard stream with a widechar name.

The first step is to check out this handy article:
CodeProject: A Handy Guide to Handling Handles
http://www.codeproject.com/KB/files/handles.aspx

The obnoxiousness is that the C++ standard does not require the fstream::attach() method or the constructor taking a file descriptor. And, alas, the latest versions of GCC made them disappear. (All this is rationalized with the idea that file descriptors and the like are OS-specific and hence somehow 'dirty' to a nice clean, abstract standard.)

The only recourse would be to write yourself a little filebuf that works with a file descriptor or Win32 file HANDLE or whatever it is you want to access with the C++ fstream.


I recommend that you use UTF-8 to handle all file names. On Windows systems you can add in a call to MultiByteToWideChar() using CP_UTF8 as the 'code page' conversion specifier. Then you can open the file using the normal Win32 CreateFile() function, and use the resulting HANDLE to access the file (as indicated above).

Good luck!
Jun 19, 2009 at 4:16am
Personally, I think it's retarded to use UTF for internal representation, just like I think it's retarded to use UCS-n for storage.


So how else do you represent text internally without limiting yourself to a single code page? I can see wanting to avoid utf-8 due to the possibly frequent occurance of multi-byte characters, but utf-16 and utf-32 are very practical to use internally.

The alternative is what... "wide character"s? What even is a wide character? Not utf-32 because it's only 16-bits on Windows. But not utf-16 because it's 32-bit on other platforms. And is it multi-byte (or really multi-wchar_t) or not? If you're sticking with just 16-bits, how do you represent codepoints above U+FFFF? Or do you just fail to support them?

If just using char, how do you manage the possibility of requiring multiple code pages? Do you constantly switch between code pages as needed? How do you keep track of which code page is used where? Do you keep a variable along side each string to indicate which code page it uses? What if the same string requires multiple code pages? Does your program just start acting stupid if the user enters an abnormal string?

wchar_t is ambiguous and inconsistent to the point of being useless. And a simple char just isn't enough if you want halfway decent internationalization.
Last edited on Jun 19, 2009 at 4:30am
Jun 19, 2009 at 5:14am
So how else do you represent text internally without limiting yourself to a single code page?
You convert all input to UCS, each character represented as a single integer, and convert all output as required.

Not utf-32 because it's only 16-bits on Windows. But not utf-16 because it's 32-bit on other platforms.
I don't get this part.

Just use std::wstring. If you need something wider, use std::basic_string<unsigned long>. If you want to have integers of guaranteed size, use #define-enclosed typedefs. And wchar_t is not any less ambiguous than any of the other types. And I've yet to seen a sizeof(wchar_t)<2, which already covers most languages in the world, by the way.

Internally encoding strings as UTF kills random access. Since my last experience with string iterators, I wouldn't use a UTF string class for any n even moderately large (say, 1M+).
Jun 19, 2009 at 6:30am
Just use std::wstring

But std::wstring doesn't work with most standard lib functions, or with most 3rd party libs. std::basic_string<unsigned long> is even worse. What good is a string class if you can't use it anywhere? You'll have to be constantly converting to and from various formats for I/O. That's fine if I/O is minimal and text processing is heavy -- but I'd wager that in most applications that's usually not the case. Text input and printing are far more common in my experience.

And wchar_t is not any less ambiguous than any of the other types


True.

Re: the rest:

If you want to draw the distinction between supporting any text or just supporting most text, you can weigh the balances as to whether or not you want to go with 16-bit or 32-bit characters internally. Or you can let sizeof(wchar_t) to decide your fate (and your memory consumption).

If random access is really a concern, UTF-32 is always an option. Or if you want to forgo codepoints over U+FFFF, you can go with UCS2 to cut down on memory usage -- but in either event, wchar_t is a poor solution. It either isn't large enough for UTF-32, or burns a hole in your RAM for UCS2.

Though, honestly, I don't see where you'd need random access in a string class for most things. About the only thing I can think of would be for formatting numerals for output. Any other text processing I can think of is (or could be) done sequentially most of the time -- so an iterator would work just fine. I suppose if you're working on a text editor where you want to modify exceedingly large files gracefully, you'd need to develop something custom for that (but you'd need to customize file i/o too -- simply reading the entire file to a string/buffer and modifying it with random access isn't very advisable -- random access is also a poor choice if you're going to allow deleting/inserting characters mid-string). If you have more practical examples of why random access is such a crucial selling point of a string class, I'm interested in hearing them. From my standpoint it just doesn't seem important at all. Far less important than being able to read/transfer/output any possible form of text.

I think it comes down to "most things" vs. "all things". There won't be a string class that satisfies everyone's needs in every possible situation. But for most things, IMO, you're better off with some kind of UTF string class that doesn't necessarily have random access.

Barring those extraordinary circumstances where you're better off coming up with something yourself -- a UTF string class, to me, seems by far the most benefitial in the most ways most of the time. If random access with limited character support is more important, then use a vector. That's basically all std::string is anyway -- a vector with a few operator overloads.

Given that the standard lib is meant to be generic -- it should apply to most things. std::string does not, and therefore fails hard in my book.
Last edited on Jun 19, 2009 at 6:33am
Jun 19, 2009 at 7:17am
Actually the sequential access part was meant to be small note, since it's true, most string operations are sequential, but I forgot to add this: The problem is that every access to a character in the string has an overhead, however small, and when n becomes large enough, the sum of all these overheads becomes quite apparent.
The last time I used an iterator to go through a large string (approx. 4 million characters) the time taken went up by an order of magnitude as opposed to incrementing a pointer.

It's also true that it depends on the application. I said that using UTF for internal strings is retarded because I'm used to actually use the strings I load in my programs (lots of parsing, etc.), but I still think std::basic_string<T> is more generic than a UTF string.
Think about it. With just a single template, you have a string that uses characters of any size, so the memory overhead depends on the programmer. Plus, you get O(1) access and zero overhead for reading a character. So you have a string class that can handle the entire UCS, given a large enough integer, at no extra computational cost, and without specialized algorithms. Compare that to a string that has overhead for character access and O(n) indexing, and requires a special algorithm for each of the three versions. Which is useful in more situations?
Jun 19, 2009 at 2:36pm
It depends on how often you're parsing the text, the means why which you're doing it, and where the text is coming from. One of the big problems with using a character of a fixed size is that text files aren't stored that way, and widget libs and other forms of text input do not provide text that way, nor do they accept output that way. So instead of doing the conversions in the text parsing step, you're doing it in the file read (and possibly write) step.

I suppose the same could be said for Unicode. After all not all files are stored in utf-8, so a conversion would be necessary in those situations as well. Perhaps that point is moot.

The last time I used an iterator to go through a large string (approx. 4 million characters) the time taken went up by an order of magnitude as opposed to incrementing a pointer.


I suppose I just can't fathom a situation in which it would be practical to maintain a string object this large. But this is what I'm talking about by extraordinary circumstances. By far, most string operations are nothing like this -- and the overhead involved in using iterators in normal situations is negligable.

I should also note that any remotely intelligent string/iterator design would be able to 'seek' through a simple string (one where size==length) without substantial overhead -- so random access would not be computationally expensive in those cases -- as would be the case with a UTF-32 string, and most UTF-16 strings.


Anyway... apparently we've just had different needs for a string class in the programs we've written. Perhaps instead of bluntly saying std::string is useless, I should've said it doesn't fullfill the needs I typically have for a string class. I guess we'll have to agree to disagree. =)
Jun 19, 2009 at 2:48pm
I guess we'll have to agree to disagree.
I don't agree with that. :-)
Topic archived. No new replies allowed.