Unicode filenames/strings

Forum

Forum
General C++ Programming
Unicode filenames/strings

Unicode filenames/strings

Feb 24, 2009 at 1:41am

I'm looking to be unicode friendly. Really, the only real time I'll need to deal with strings is for file names. On-screen text printing is handled seperately and isn't something I need to concern myself with. It looks to me that fstream (as well as C's fopen) take const char* for the filename instead of a string object -- which is fine. I know to call string::c_str.

Part of my confusion lies in that it seems to be const char* and not const wchar_t*. So this would work with std::string, but not with std::wstring? (is wstring even part of the standard? I didn't see mention of it on this site, but it's on MSDN -- is it MS specific?). If this is the case it makes wstring pretty much useless. I mean what good is a string class if no other area of the library excepts that string's format?

I guess my question is... does C++'s string class (and more importantly it's fstream class) handle unicode filenames gracefully? Does std::string hold utf-8 encoded strings? If that's the case, does string::size return the real character count or the byte count? And is this behavior standardized across all platforms/compilers?

Up until now, I've been working with a seperate library (wxWidgets) which has its own string and file classes. But now the program I'm working on can't have any of that, so I need to stick to standard C++... and documentation on the subject doesn't seem to be as clear as it should be.

Any help appreciated!

Feb 24, 2009 at 3:24am

helios (17607)

I don't think it's possible to open a file in C/++ using wchar_t. Probably because not all file systems support Unicode (the FAT family comes to mind). I think it might be possible to use UTF-8 strings as a file name in some systems, but I really don't know.

std::wstring is very much standard. It uses wchar_t internally, so of course the size of its characters depend on the size of wchar_t (sizeof(char)<=sizeof(wchar_t), by the way).

std::string uses plain chars as its elements, none of that fancy UTF-8.

Bottom line: If you need to open Unicode file names, either use system calls or Boost if you want to stay portable.

Feb 24, 2009 at 4:01am

Disch (13742)

ugh. That's what I was afraid of. I'm reminded of why I avoided using std::string my whole C++ life. Utterly useless.

I want to refrain from using Boost as I've heard some questional stuff about it -- like that it takes advantage of known compiler bugs/flaws to do things which wouldn't normally be possible in standard C++.

Guess I'll have to cruise the net for a better file I/O lib than what C++ has out of the box. I really wish standards were updated to handle localization better. This really isn't something the programmer should have to worry about so much.

Anyway, thanks!

Feb 26, 2009 at 2:03am

Skorj (27)

The answer to your question is somewhat platform specific. In some fielsystems, filenames are defined as strings of char, in others (NTFS) they're strings of wchar_t. In Win32 programming, most of the file API works natively with wide strings, but on most other platforms there's no support for wide char filenames in the system library.

There's no common library string class that is "codepoint oriented" in C++, so UTF-8 wouldn't be handled the way you'd like. If your really want to handle Unicode strings safely, you'd wan't basic_string<unsigned> anyway, as there are Unicode codepoints above 0xFFFF.

Filenames aren't usually such a problem, as the system library will generally accept wide strings if there is a native filesystem that uses wide strings.

Feb 26, 2009 at 10:22am

Disch (13742)

In some fielsystems, filenames are defined as strings of char, in others (NTFS) they're strings of wchar_t.

I realize this, but I stand on my opinion that the C/C++ standard file i/o library should treat the const char* string filenames passed utf-8, and convert to wide character strings where necessary in their implementations. I really can't think of a good reason why they don't do this. This is a huge oversight in the standard library, imo.

you'd wan't basic_string<unsigned> anyway

The thing is such a string would be totally useless because no library functions take it. So sure that would hold a utf-32 string nicely enough, but I wouldn't be able to do anything with it once I have it. (That's beside the point that utf-32 is a huge waste of space)

This is one of the big reasons why I dislike std::string. It's really nothing more than a glorified vector with a few replace functions. It's completely inadequate for any kind of real string work.

Filenames aren't usually such a problem, as the system library

Right. But the system library isn't easily portable. I realize it's possible to make due -- but I was hoping for a standardized way to do this that would be easily portable and wouldn't be a huge pain in the butt. Apparently there isn't one.

I settled on making my own string class which can convert between utf8<->utf16<->utf32 (just spent a few hours on this, finally got it done). And I'll have to make a series of File classes derived from an abstract base class which will use these strings for filenames so I can convert to the appropriate character size for the current platform, and use that platform's native API for file i/o.

A lot of extra work to do handle details that no modern-day programmer should have to waste time thinking/worrying about. It really gets me that C++ is so far behind on this. Instead of just having a simple and functional string API, the programmer has to wrestle with localization and wchar_t ambiguity issues.

But I guess that's why people like Boost.

</complaining>

Anyway thanks for the help, guys. Sorry I kind of sound like a big jerk XD, I'm not normally one, I swear!

Feb 26, 2009 at 7:56pm

Skorj (27)

There is no definition of a "filename" in C++, because that's a property of any given file system, not of the language. Most file systems allow a filename that looks like a UTF-8 representation, but isn't, or a filename that would be an illegal sequence of bytes in a UTF-8 string. Of course, having both options would be nice!

So sure that would hold a utf-32 string nicely enough, but I wouldn't be able to do anything with it once I have it. (That's beside the point that utf-32 is a huge waste of space)

A sub-$1000 laptop these days has 2GB of memory. The in-memory size of a character is really a 20th century worry. Microsoft tried to get ahead of the curve by making all system calls use 16-bit characters, but it seems that wasn't forward-thinking enough. IMO, the right answer is for platform SDKs to move to UTF-32 for everything.

But the system library isn't easily portable

True. C++ is a good language for doing hardware/architecture/system specific work, but is a poor language for portability across things like filesystems. There are plenty of high-level languages that let you (make you) program against an abstract virtualization of a system.

I settled on making my own string class which can convert between utf8<->utf16<->utf32 (just spent a few hours on this, finally got it done).

Yes, I too create some class derived from basic_string<> when I need to solve the same problem. I wouldn't say std::string is useless at all - it's great as a starting point for this.

I share your desire for a "codepoint aware" standard string class, along with standard library function that provide a threading abstraction. Sadly, the C++0x standard is a bit behind such modern concerns. Soon we'll be getting regular expressions, a better macro language, opt-in garbage collection, and other things that were hot 10 years ago. Hey, at least it's progress.

Feb 26, 2009 at 8:28pm

helios (17607)

Microsoft tried to get ahead of the curve by making all system calls use 16-bit characters, but it seems that wasn't forward-thinking enough.

Technically, all NT file system calls use wchar_t, which happens to be of size 2. If it was necessary to use wider characters, resizing wchar_t wouldn't be much of a problem (for new builds, that is. IINM, Vista-64's wchar_t is of size 4). Since NTFS uses UTF-8 for file names, I'm guessing this design was chosen to relieve the programmer from having to do the conversion themselves. Then again, a convenience system call for this sort of thing would have been nice.

True. C++ is a good language for doing hardware/architecture/system specific work, but is a poor language for portability across things like filesystems.

I think you're thinking of Assembly. Both C and C++ are used all the time to write applications that can run on any platform. The fact that the standard library only supports narrow characters for file names is evidence of this. If for any reason your application needs to be able to be able to use wide character strings to open files, there are libraries that can be used for this.

I share your desire for a "codepoint aware" standard string class

You mean one that only works with Unicode? I don't. That would reduce the portability of the code, since not all systems support Unicode. Shift JIS, for instance, is still in use in Japan, even though it's as obsolete as a technology can get. EBCDIC is also still in use, although in a different context. Also, despite everything pointing to UCS being The Last Code Page, it's really impossible to know what will happen in the future.

While I like threading, I have to disagree with regular expressions and garbage collection. In my opinion, they go against the language's philosophy. "You need regex? Get a library! You need garbage collection? Switch languages! This is a language for Men! If you can't keep track of all those pointers, you should go back to Java and code another silly applet."

Feb 26, 2009 at 8:38pm

Disch (13742)

Most file systems allow a filename that looks like a UTF-8 representation, but isn't, or a filename that would be an illegal sequence of bytes in a UTF-8 string. Of course, having both options would be nice!

I'm of the opinion that this should be hidden from the programmer in a standard 'catch-all' library. The bottom line is that whatever C++ does for files will never match every file system exactly -- so the goal of the library should be to make it work with every file system. If this involves converting utf-8 to whatever format the filesystem uses, then that's what it should do.

The programmer shouldn't have to worry about which format the filesystem uses for filenames -- that's micromanagement. If the programmer chooses to want to deal with such details, then they can opt to deal with filesystem I/O calls directly -- this is the beauty of C++. However they should not have to.

Unicode codepoints can represent virtually any string used for a filename.. so why not take advantage of that. Sure it may not work 100% in some obscure filesystem hardly anyone ever uses, but it would be better than hardly working at all (which is what I'd call its current state).

The thing that makes this so tragic is that the interface doesn't even have to change. Just the implimentation does.

A sub-$1000 laptop these days has 2GB of memory. The in-memory size of a character is really a 20th century worry.

I try really hard not to get caught up in this line of thought. To me it makes no sense that programs should be intentially less efficient simply because computers are more powerful. UTF-32 would make the programmer's job slightly easier, sure, but at the cost of heavier RAM usage, bigger executables, more cache misses, slower code, etc, etc.

Though I realize this is a heavy debate and we simply disagree on it -- and it isn't really my primary point anyway, so I don't really want to pursue this further. Plus I'm still on a 20th century computer with only 512MB of RAM, so I'm probably biased.

but is a poor language for portability across things like filesystems.

It's actually great for it provided you have the right libraries. I've been using wxWidgets for a while now and it is fantastic for this kind of thing. I'll take it over .NET or Java any day of the week.

C++ blends the best of high and mid-level languages in that it can be portable, but also can do low level things specific to a particular machine that other, higher level languages can't.

Basically what I'm saying is, it's all about the libraries. The language is just the interface to communicate with those libs. I'm not saying that standard libs should be as extensive as something like wxWidgets... but they should at least be functional for practical modern-day application. And localization should be a concern for any modern-day programmer (or at least to those who intend to release their programs to the public).

C++ strings/files are fine for quick test programs, but once you get past a certain point, you pretty much are forced to use something else, or you're stuck with some really crippling limitations.

Yes, I too create some class derived from basic_string<>

heh, I didn't even use basic_string XD, I made mine from scratch. Wasn't sure if basic_string reference counted, among other things.

------------------------------

Technically, all NT file system calls use wchar_t, which happens to be of size 2
<snip>
Vista-64's wchar_t is of size 4

Great example of the ridiculousness of C++ strings. "wide character" is a pretty meaningless and useless term. Or at least its use varies far too much to be of any practical value.

You mean one that only works with Unicode? I don't.

I meant one that works internally on Unicode. All characters in S-JIS (and virtually every other locale) can be represented in Unicode -- so when the library interacts with the filesystem (or whatever else), it can make the necessary conversion. The programmer need not know anything about it.

Last edited on Feb 26, 2009 at 8:47pm

Feb 26, 2009 at 9:15pm

helios (17607)

Great example of the ridiculousness of C++ strings. "wide character" is a pretty meaningless and useless term. Or at least its use varies far too much to be of any practical value.

Actually, that has less to do with strings, and more to do with the C++ type system. wchar_t is only guaranteed to be at least as big as char. That's it.

And who do you propose do the conversion?
The language? That's beyond its scope. Also, code page conversion involves very big tables, and they must be included in the final binary.
The system? Then the language is no longer completely portable, because it's depending on the system to do those conversions.
The programmer? That's the case now. The programmer decides what kind of data a string holds and converts it as necessary.

Last edited on Feb 26, 2009 at 9:15pm

Feb 26, 2009 at 9:29pm

Skorj (27)

Just a quick note on "characters", "codepoints", and standards committees.

A "codepoint" is a number representing a (perhaps partial) glyph. UTF-x, and UCSx, Latin-1, and 7-bit ASCII all use the same codepoints (or a subset of the same set). All of these approaches use the same number for "a", unlike EBCIDIC with it's unique set of codepoints.

Translating between codepoints is easy in practice. S-JIS isn't hard here.

A "character" (in the sense of a complete glyph, not the sense of the char type) is a very difficult thing to describe. Character boundaries are hard, perhaps impossible to determine from a stream of codepoints. In Romanized alphabets, you can deal with cases like "nonspacing accent character followed by vowel" being the same character as the single codepoint representing "vowel with accent". In other scripts where everything is done though character composition, there's no universal agreement on what a "character" is.

Codepoints are easy, and the "Universal Character Set" that is now maintained by the Unicode standards group is clearly the future of defining codepoints. Characters are hard, and recognizing that two words are "the same" from a human point of view is not a well solved problem yet.

Finally, if the C++ standard library were to require some understanding of Unicode, this would be problematic in practice because these standards are defined in two different standards committees. The Unicode standard (if you include the character database) changes constantly. The W3C tried this with the XML 1.0 spec, and had to make the XML 1.1 spec as a way of giving up on keeping the standards in sync.

A function as simple as "is_letter()" would be obsolete before the standard was approved, and "to_lower()" would just be a nightmare.

Feb 26, 2009 at 9:53pm

Disch (13742)

Valid points all around. Perhaps my dream of a perfect world where text can be a simple subject is just that: a dream.

I had a lot of fun with this discussion though. You guys are a great bunch to talk to about this kind of thing.

Topic archived. No new replies allowed.