Poll about Unicode

All over this place and many others, I keep seeing everybody using cout, and string, and all this ANSI stuff.

Myself, as Windows programmer, have fully abandoned them and use wcout and wstring 100% of the time. It just makes my life simpler as I don't care about my stuff not running under Win9X.

Is Unicode fully supported in Linux, Mac, Unix, HP-OS, etc.? Are they the reason of all this ANSI stuff holding on?

So, if you are bored (and therefore reading this. :-P), drop a line with your opinion about this matter.

Disch (13742)

I practically never do console stuff, so I never use cout.

Also... does wcout even work? I recall a thread in which I was helping a guy output Unicode to the console and wcout was totally useless. I ended up having to use WinAPI calls to get Unicode output working.

I don't ever use std::string except in quick one-off programs. Generally my programs use whatever string class comes with the widget lib I'm using (ie: wxString) since they tend to be more Unicode aware/friendly.

When not dealing with widgetry libs (like when I'm working on games or the like), I have my own string class I wrote that's part of a larger library I'm working on. It is very Unicode friendly.

So yeah... Unicode ftw.

webJose (2948)

I have had no issues with wcout when writing console apps and compiling with Visual Studio 2008 Pro. So yes, it works as far as I care.

Thanks for dropping by. :-)

Disch (13742)

Maybe old versions of VS didn't work.. I think I had VS2002 or something. I can't remember.

Oh well. Glad to hear it works now. Although cout should still at least output utf-8 by default. Unicode should always be the default.

Duthomhas (13290)

Whatever you are using is non-standard (and amazing that it works so nicely). The C++ standard makes wcout a no-op, meaning that whatever you pass it is narrow()ed to a char and outputted that way.

POSIX systems have supported UTF-8 for a long time now.
Windows systems tend towards UTF-16.

Neither is more or less correct. The problems we have with Unicode adoption is that it is a bigger can of beans than I think you are aware, in addition to hysterical raisins.

IMHO, I think taking things as UTF-8 is better for the standard streams than UTF-16 -- it interoperates much better with legacy stuff and avoids messy endianness issues.

However, in my application I prefer UTF-16 (BMP only) or UTF-32. Alas.

webJose (2948)

Yes, it does work ok (wcout). I was not aware that it was non-standard. Does this mean that console applications are doomed to present ANSI text only? Or am I missing something here? I mean, if wcout is not supposed to work properly, does this mean that no console applications can show, for example, text in Japanese?

You may be right about me not knowing all the problems in adopting Unicode. This is why I started this thread. To understand better.

As for UTF-8, no surprise it works better with legacy stuff. :-) After all, it has been designed to be backwards- compatible with ASCII.

Now that you mention endianness, do you (or anybody else) know which hardware works with big endian nowadays? I know Intel uses little endian, and Motorola uses (or used some time ago) big endian, but now that Mac uses Intel, should we care? BTW, I am under the impression that all Intel machines use little endian. Correct me if I'm wrong.

Disch (13742)

The problems we have with Unicode adoption is that it is a bigger can of beans than I think you are aware

I don't see what's so big about it.

As far as I understand it, all modern operating systems use Unicode internally anyway, so the output gets converted to Unicode at some point down the line. It seems like there'd be a bigger fuss involved in converting back and forth between umpteen different code pages rather than just using Unicode all the time.

Besides, other languages (Java, for instance) don't have a problem with it. I don't see why it's such a big deal for C++.

ANSI and code page support is just a legacy thing nowadays. Things would get a whole lot simpler if we could just scrap it all and do everything in Unicode. Of course that's not an option due to the massive number of files/programs/etc that would become broken as a result. That's the curse of backwards compatibility.

But I recall getting in a similar discussion with helios who brought up some somewhat convincing points, none of which I can recall at this moment. I think the general idea was "simpler is better, if you want to complicate it, you can do it through code". Part of me agrees with that... but eh.

EDIT: correction, it was helios, not Bazzy. My mistake.

Last edited on

helios (17607)

Odd. That does sound like something I would say, but I can't recall ever saying anything like that.

Does this mean that console applications are doomed to present ANSI text only? Or am I missing something here? I mean, if wcout is not supposed to work properly, does this mean that no console applications can show, for example, text in Japanese?

It's perfectly possible to output Japanese to a console in Windows. You just have to output Shift JIS and have the use set things up properly. Unfortunately, if I ever catch you doing that, I'll stab you.

It's kind of a conflict. The Windows console is not designed for Unicode, but sending anything that isn't UCS is problematic. IMO, if you need Unicode, don't use the console; that's what GUIs are for. What I've been doing lately is sending, say, error messages to the console in UTF-8, and praying that the console understands it. If it does, great; if it doesn't, at least the user can redirect to a file and read it with a program that does understand UTF-8.

I don't see why it's such a big deal for C++.

Because, technically speaking, Unicode is not portable. The basic character set has to be representable in chars, one character per char. That's why we have wchar_t (which has its own problems as well).

ANSI and code page support is just a legacy thing nowadays.

ANSI isn't legacy! It's the lower 8 bits of UCS! Not only does that cover most of the West's needs (English, Spanish, Portuguese, French, Italian, German, and probably a few others. The first four alone are basically the entire American continent), but it can also easily be extended to UCS-2 or -4 if that's not enough.
I do agree that anything other than UCS is legacy, though.

webJose (2948)

Ok, so yes, the console is damned in the ANSI world. :P

Incubbus (678)

i am wondering, if it ever happens that they force us (for example) using the unicode version of cout(wcout), they will rename wcout to cout to simplify usage and let us press less buttons on they keyboard^^... i hate this special W in every unicode name-.-...

Disch (13742)

Because, technically speaking, Unicode is not portable.

I don't buy this argument. In fact I'd argue the exact opposite.

Unicode is commonplace pretty much everywhere, even on portable devices.

There's always going to be a portability vs. functionality tradeoff here. In order to have language features, you need to make assumptions about the base machine. C++ already makes several assumptions (has a console, has a string-based filing system, etc) all of which are theoretically not portable to some extremely obscure system. The way I see it, Unicode should just be added to the list.

I mean come on... it's 2010. Unicode has been the standard for years. Over a decade now.

Granted, there might be some obscure or specific-purpose machine that doesn't work well with Unicode, but anyone targetting that machine should be aware of it and should adjust their code appropriately. It's like I said before... Unicode should be the default. If you don't want it / can't use it, then you should have to explicitly work around it --- not the other way around.

helios (17607)

I actually meant UCS, not Unicode, in my previous post. I often use them interchageably, which is of course a mistake.

The language definition can't assume any particular code page because that would make the language less portable. Even TC++PL says that it's not safe to assume that alphabetic characters are adjacent. While right now it seems that UCS will never disappear, who knows what might happen tomorrow. If you want to assume UCS in you code, that's fine, but the language definition can't afford it.

Java, on the other hand, can. Because it's running on a controlled platform, it's free to make all the assumptions it wants. As long as the host can run the implementation, there's no problem.

(And C++ is also more than a decade old.)

Last edited on

Duthomhas (13290)

Unicode is not ubiquitous. The "Unicode Standard" has existed only a very short time in computer years, and it has had some significant changes through that time.

Current versions of Java use a modified UTF-16. Previous versions use "Modified UTF-8" (which really should have been named "Modified CESU-8") for all internal storage. You can get it to give you UTF-16, but that is it. Oh, and it only handles a relatively small subset of code point classes properly. Alas.

And just saying "I've got Unicode" means nothing, because Unicode does not map characters to code points. So when a C++ techie says "wchar_t" can handle Unicode characters, he is wrong. It can handle Unicode code points (...maybe. The size of wchar_t is not prescribed. On Windows 4.x+ it is 16 bits. On Linux, it is usually 32 bits. Elsewhere, it can be anything).

Sure makes handling Unicode a breeze. So you guys must be right.

So prove it to me. Make me a console program that outputs Japanese or Russian to my Windows console using standard C++ and wcout. (Oh, I'd like it to work in Linux too.) The STL locale library is, quite simply, broken. Unless you rely upon an implementation-specific extension, all you can do is output ASCII or UTF-8.

I will agree that it sure would be nice if everyone implemented Unicode simply. The problem is simply that Microsoft, et. al. have been working on internationalization issues for decades, and the system they use has worked. Now everyone is jumping on different versions of the Unicode bandwagon and claiming all kinds of nonsense and producing all kinds of library incompatabilities which programmers like to war about.

At least if you transform your text into a proper UTF I can read it. I don't care how you do it, but if you do I can read it.

The issues are sufficient that the people experienced enough to be working on the C++0x standard are waiting to learn more before making a hard decision about adding strong internationalization protocols to C++. Until then, we are all lost in various versions of iconv and ICU and wcombs(), etc. Simple to use, but complex to grok. Oh, I guess that makes it complex to use after all...

PanGalactic (1658)

I use UTF-8 when I need Unicode, but we mostly stick to the ASCII subset of UTF-8.

Topic archived. No new replies allowed.

C++

Forum

Poll about Unicode