Most file systems allow a filename that looks like a UTF-8 representation, but isn't, or a filename that would be an illegal sequence of bytes in a UTF-8 string. Of course, having both options would be nice! |
I'm of the opinion that this should be hidden from the programmer in a standard 'catch-all' library. The bottom line is that whatever C++ does for files will never
match every file system exactly -- so the goal of the library should be to make it
work with every file system. If this involves converting utf-8 to whatever format the filesystem uses, then that's what it should do.
The programmer shouldn't have to worry about which format the filesystem uses for filenames -- that's micromanagement. If the programmer
chooses to want to deal with such details, then they can opt to deal with filesystem I/O calls directly -- this is the beauty of C++. However they should not
have to.
Unicode codepoints can represent virtually any string used for a filename.. so why not take advantage of that. Sure it may not work 100% in some obscure filesystem hardly anyone ever uses, but it would be better than hardly working at all (which is what I'd call its current state).
The thing that makes this so tragic is that the interface doesn't even have to change. Just the implimentation does.
A sub-$1000 laptop these days has 2GB of memory. The in-memory size of a character is really a 20th century worry. |
I try really hard not to get caught up in this line of thought. To me it makes no sense that programs should be intentially less efficient simply because computers are more powerful. UTF-32 would make the programmer's job slightly easier, sure, but at the cost of heavier RAM usage, bigger executables, more cache misses, slower code, etc, etc.
Though I realize this is a heavy debate and we simply disagree on it -- and it isn't really my primary point anyway, so I don't really want to pursue this further. Plus I'm still on a 20th century computer with only 512MB of RAM, so I'm probably biased.
but is a poor language for portability across things like filesystems. |
It's actually great for it
provided you have the right libraries. I've been using wxWidgets for a while now and it is fantastic for this kind of thing. I'll take it over .NET or Java any day of the week.
C++ blends the best of high and mid-level languages in that it
can be portable, but also can do low level things specific to a particular machine that other, higher level languages can't.
Basically what I'm saying is, it's all about the libraries. The language is just the interface to communicate with those libs. I'm not saying that standard libs should be as extensive as something like wxWidgets... but they should at least be functional for practical modern-day application. And localization
should be a concern for any modern-day programmer (or at least to those who intend to release their programs to the public).
C++ strings/files are fine for quick test programs, but once you get past a certain point, you pretty much are forced to use something else, or you're stuck with some really crippling limitations.
Yes, I too create some class derived from basic_string<> |
heh, I didn't even use basic_string XD, I made mine from scratch. Wasn't sure if basic_string reference counted, among other things.
------------------------------
Technically, all NT file system calls use wchar_t, which happens to be of size 2
<snip>
Vista-64's wchar_t is of size 4 |
Great example of the ridiculousness of C++ strings. "wide character" is a pretty meaningless and useless term. Or at least its use varies far too much to be of any practical value.
You mean one that only works with Unicode? I don't. |
I meant one that works internally on Unicode. All characters in S-JIS (and virtually every other locale) can be represented in Unicode -- so when the library interacts with the filesystem (or whatever else), it can make the necessary conversion. The programmer need not know anything about it.