UTF-8 in command prompt (console)

Forum

Forum
Windows Programming
UTF-8 in command prompt (console)

UTF-8 in command prompt (console)

Pages: 12 3

Apr 11, 2009 at 1:00pm

Hello.

First stop, I'm using Microsoft Visual Studio 2005 running on Windows Vista Ultimate x64 SP1.

I want to make a console program that has a UI in my native language (Bulgarian) and alphabet (Cyrillic). That is, it shows prompts in Cyrillic, accepts and processes Cyrillic text.

I am able to do so by using the AnsiToOem() function from <windows.h>. However, as far as I'm aware, this function is dependent on the locale of the OS, and if that locale does not support Cyrillic (as is the case with the default English one), the text will be gibberish.

I've had experience with Unicode and UTF-8 before in other languages, so I know it supports Cyrillic and pretty much all alphabets on the planet. I've read a few times there is a way to get UTF-8 printed in the console. But how? I've never seen any working examples of that. I've tried to use a wchar_t array printed on wcout, but no contents from wcout gets printed that way, and I have no idea why. Printing the wchar_t on cout results in a number (I bet the memory address to it, or if you place a "*" - the character code).

Example code (not displaying anything but the pause message):

#include <iostream>
#include <locale>
#include <windows.h>

using namespace std;

int main() {
	wchar_t example[] = L"Текст на кирилица";
	wcout << example << endl;
	system("pause");
	return 0;
}

Edit & run on cpp.sh

Another example (displaying a 1058, which I assume is the code for "Т"):

#include <iostream>
#include <locale>
#include <windows.h>

using namespace std;

int main() {
	wchar_t example[] = L"Текст на кирилица";
	cout << *example << endl;
	system("pause");
	return 0;
}

Edit & run on cpp.sh

And a working, but dependent on locale example:

#include <iostream>
#include <locale>
#include <windows.h>

using namespace std;

int main() {
	char example[] = "Текст на кирилица";
	AnsiToOem(example, example);
	cout << example << endl;
	system("pause");
	return 0;
}

Edit & run on cpp.sh

This is just for displaying text... I think I can take user input through the same steps as for output if I have to (as I do with AnsiToOem()), but displaying text is sure the first part.

Note: please forgive the system("pause") calls... but this is just testing anyway. I wouldn't use that in production programs.

Apr 11, 2009 at 3:34pm

Disch (13742)

I don't really know the answer to this problem... but here's my idea:

wchar_t example[] = L"Текст на кирилица" <--- this might not be unicode.. depending on how the compiler compiles the string (or how your IDE stores the string to disk). IIRC VS does not save UTF-8 files, so your cryllic is likely a locale-based setting on the file, and not really unicode. The cast up to wchar_t with the 'L' prefixing it might be just casting up the characters without converting them to Unicode (so all your codepoints in the string would be < U+100)

try this to test:

1
2

  wchar_t example[] = L"\x043B";
  wcout << example << endl;

That should output the л character (U+043B).

If that works the solution here is to either explicitly input UTF-16 encoded numerical values with escape characters (as in the above example) or convert the strings to wchar_t at runtime, rather than simply casting them. I believe that can be done with the "mbstowcs" function in <cstdlib>

EDIT - but even with mbstowcs, the string would have to be UTF-8 encoded -- which again probably isn't the case here. And that's assuming that's how those functions work. Honestly "multibyte string to wide character string" are awfully generic terms so who knows if they're standardized to convert between unicode encodings or not (I always though the standard libs fell short in this department)

Last edited on Apr 11, 2009 at 3:37pm

Apr 11, 2009 at 4:54pm

boenrobot (33)

Thanks for the reply.

Unfortunatly, encoding the letter doesn't seem to do the trick.

#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>

using namespace std;

int main() {
	wchar_t example[] = L"\x043B";
	wcout << example << endl;
	system("pause");
	return 0;
}

Edit & run on cpp.sh

Doesn't output anything.

If I use this encoding form (even without an L in front of the quotes) at a "char" variable, I get a "Too large for a character" error at compile time. If I write "л" directly in a char, and then use mbstowcs() to place it in a wchar_t variable, like so:

#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>

using namespace std;

int main() {
	char exampleC[] = "л";
	wchar_t example[3];
	mbstowcs(example, exampleC, 1);
	wcout << example << endl;
	system("pause");
	return 0;
}

Edit & run on cpp.sh

I get "ы" as output, which isn't right (obviously). Depending on the size of "example", I get different characters after that. Increasing the third parameter doesn't make much difference either, other than eliminating the extra character (i.e. I get just the wrong "ы").

I tried the same stuff when the file is saved as UTF-8 before compilation, and the results are the same. Like I said, I use Visual Studio 2005... and actually, I just got myself 2008, just to see if it makes a difference, and it doesn't. Not with its default settings at least. Any idea as to what setting(s?) I should tweak to make it happen?

I once saw this WideCharToMultiByte() function (http://msdn.microsoft.com/en-us/library/aa450989.aspx) as a related to this issue, but how on earth is THAT one used? And is it the "silver bullet" so to speak?

[rant]I see no examples on the MSDN page, and every other example I've been able to find on the web seems to be just about calculating the difference in bytes between a Unicode string and the normal string. I mean, OK, I get that Unicode strings are bigger, but I still want to output them somehow. OK, it's subject to data corruption ("buffer overrun") if used unwisely, so show an example of wise use.

As a side note: why do every internationalization examples (if at all available) still use latin letters? There are never problems with those, so how is a user supposed to notice if the thing is working or not?[/rant]

Last edited on Apr 11, 2009 at 5:01pm

Apr 11, 2009 at 5:25pm

Disch (13742)

I'm actually very surprised that the straight encoding didn't work... I figured if anything would work that would.

I'd test these ideas myself if I had a Windows machine handy... but since I don't, all I can really do is continue to throw out ideas that may or may not help =(

I get "ы" as output

This surprised me a bit. How about this:

char exampleC[] = "л";
wchar_t example[3];
mbstowcs(example, exampleC, 1);  // if you were to "wcout << example" now you'd get ы
// so instead...
cout << (unsigned)example[0];

what number does that print?

Unless there's some really, really funky locale crap going on, it should be 1099 (0x44B ... ы = U+044B). If it's < 256 then the problem might be that you need to mess with some locale settings to switch it over to Unicode or something. I'd have to do a bit more research as to how to actually do that though =x

I wish I could test this myself to help you out more. Sorry about that. Hopefully someone else might have an answer here, or maybe we can figure this out in the meantime.

Apr 11, 2009 at 5:45pm

boenrobot (33)

Another suprise... it's neither... it's 235.

I know where to mess with the locale (from Windows that is... but not from the program itself), but as far as the Windows setting says, the locale's codepage is applied for programs that do not support Unicode. So the question now becomes "How to make a console program support Unicode, so that it doesn't use that setting?" or at least "How do I manually switch to another locale and/or codepage within the program, and switch it only for that program?".

In Linux (or is it MAC?), do you adjust the compiler or the source for that? Or you haven't played with this (or Unicode in general) before on your OS? If the compiler, what kind of settings do you look for? Even without exact names, I'd at least have an idea what to look for and try.

Last edited on Apr 11, 2009 at 6:04pm

Apr 11, 2009 at 6:05pm

Disch (13742)

235 is indeed < 256.

This tells me that nothing Unicode related is going on. mbstowcs is instead [poorly] converting the string and/or wcout/cout is outputting text as per your system's locale setting. Since you probably have your system set to use Cyrillic, you're getting Cyrillic characters for values over 0x7F (but someone on a system with a different locale setting running the exact same .exe might get japanese characters or whathaveyou). But again if this really were the case, I would imagine that you should just be able to cout a Cyrillic string and have it display properly. Gah!

I'm currently on Linux, and so far all my experience with the console has indicated that it accepts UTF-8 strings with no conversion necessary.

What gets me is that I know for a fact that Windows uses Unicode internally (I've made several Unicode friendly Windows programs in the past -- just never anything that outputs to the console) -- so I'm dumbfounded as to why the console doesn't want to accept it. Only other thing I can think of would be to #define UNICODE before all of your includes and then try all these tests again to see if you get different results.

As for explicitly changing the locale setting for your program -- I haven't a clue as to what you'd need to call for that (if that's even an option). I don't think it'd be anything related to any standard libs, though -- it'd be some function in WinAPI somewhere.

This is actually one of my big frustrations with C++. Standard libs are just totally inadequate for anything other than basic Latin character output.

Apr 11, 2009 at 6:40pm

boenrobot (33)

Defining UNICODE before all includes doesn't make any difference unfortunatly.

Like you, I'm most stunned that

#define UNICODE

#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>

using namespace std;

int main() {
	wchar_t example[] = L"\x043B";
	wcout << example << endl;
	system("pause");
	return 0;
}

Edit & run on cpp.sh

Doesn't output anything.

For all other examples, you may think that the character isn't really treaded as a "> 256" character, which could be a reason for wcout not to output anything (or output it wrong), but here, it's explicitly encoded as such a character, using "< 256" characters. If wcout is at all supposed to output wide characters (as its name suggests), it should be able to output something with that.

I realize a person with other locale settings would see other (still faulty in this case) characters. Being locale independent is the reason I started this topic to begin with.

Perhaps the problem is wcout itself... in what cases would it decline to output a character sequence? Outputting wchar_t sequence?!? Doesn't sound right, and besides, it does so with the last example (though it prints the wrong character, because apparantly, mbstowcs() doesn't really convert it properly). If I try to print a char[] sequence with "л" in it, the results in both cout and wcout (without mbstowcs()) are "ы".

Another oddity. If I have only latin characters in wchar_t, I get output on wcout. If I have mixed content, output stops at the first non latin (and I guess any non ASCII character) in the character sequence, so for example:

#define UNICODE

#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>

using namespace std;

int main() {
	wchar_t example[] = L"Latin text на кирилица and more latin";
	wcout << example << endl;
	system("pause");
	return 0;
}

Edit & run on cpp.sh

outputs "Latin text ". Notice there's not even a new line as per "endl"! It seems it gets corrupted or something at that point, so it stops doing anything from that point on. Replacing "на кирилица" with "\x043B" gives the same results. As if (gasp) it doesn't support wide characters.

Last edited on Apr 11, 2009 at 7:08pm

Apr 11, 2009 at 7:19pm

Disch (13742)

After a bit of googling I came across the following:

http://msdn.microsoft.com/en-us/library/ms686036(VS.85).aspx -- SetConsoleOutputCP
http://msdn.microsoft.com/en-us/library/dd317756(VS.85).aspx -- code page ids
http://msdn.microsoft.com/en-us/library/ms683169(VS.85).aspx -- GetConsoleOutputCP

I suppose the way to do this would be:

int main()
{
  UINT oldcodepage = GetConsoleOutputCP();
  SetConsoleOutputCP(65001);  // for UTF-8 ... or use ?1251? for cyrillic
    // -- there appear to
    //  be many different cyrillic options -- not sure which one is desired.
    // UTF-8, at least, is always consistent

  // ... output crap here

  SetConsoleOutputCP(oldcodepage);
  return 0;
}

UTF-8 is probably more easily portable, but then you have to make sure your text is UTF-8 encoded (and since cyrillic characters are multi-byte in UTF-8... things like std::string::length() and strlen() will return values higher than the actual number of characters).

Apparently you might also need to worry about SetConsoleCP, too for the input -- so you might have to look that up as well.

blech. Hope that works for you.

Last edited on Apr 11, 2009 at 7:21pm

Apr 11, 2009 at 8:43pm

boenrobot (33)

Interesting... the results are unexpected, yet closer to something that actually works.

Here's the code I tested:

#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>

using namespace std;

int main() {
	UINT oldcodepage = GetConsoleOutputCP();

	cout << oldcodepage << endl;
	cout << "Текст на кирилица" << endl;

	SetConsoleOutputCP(CP_UTF8);
	cout << "Текст на кирилица" << endl;
	SetConsoleOutputCP(1251);
	cout << "Текст на кирилица" << endl;

	SetConsoleOutputCP(oldcodepage);
	system("pause");
	return 0;
}

Edit & run on cpp.sh

(I double checked that CP_UTF8 is 65001)

The output of this is

866
╥хъёЄ эр ъшЁшышЎр
Oaeno ia ee?eeeoa
╥хъёЄ эр ъшЁшышЎр

and wcout, regardless of where it is put, breaks in the same way.

I'm kind'a surprised by my codepage, since I'm Bulgarian with a Bulgarian locale, not a Russian. Windows-1251 is a more popular choice for Bulgarians, especially for web devs.

I recalled reading something about the "Lucida console" font (of the command prompt) being the only UTF-8 aware font, so I gave that a try. Surprisingly, with it, UTF-8 did not worked, but Windows-1251 did. That is, the output was:

866
╥хъёЄ эр ъшЁшышЎр
��
Текст на кирилица

This doesn't really make Windows-1251 a better choice though. With it, the program is different depending on the command prompt font. With cp866, on the very least, you get the same crappy output in both command prompt fonts....

Hmmm.... well in that case, I suppose I could manually set the codepage to cp866, and then use AnsiToOem() on that. That would work, and make it locale independent, since the codepage would be explicitly set. Now the only question in this case becomes if there's a way to make such program portable, i.e. compilable for Linux (and still using the same texts). I guess Linux is a sacrifice I can make, but it would be nice if it's included as well.

Last edited on Apr 11, 2009 at 8:54pm

Apr 12, 2009 at 1:37am

Disch (13742)

Well the important thing is to match the encoding of the compiler. I'm sure UTF-8 would work with that font, as long as the text you're giving it is UTF-8 encoded. From the looks of it, it doesn't appear to be.

A quick way to test could be:

const char testa[] = "я";
const char testb[] = "\321\217";

// should output "я" only if the way the source file is encoded matches the 
//   console output encoding
cout << testa;

// should output "я" if UTF-8 output:
cout << testb;

You could also open the source file up in a hex editor (VS comes with one built in -- just open the file with the "binary editor") to see how the text is encoded. The я character in the source should appear as "D1 8F" in the hex editor if the file really is UTF-8 encoded.

*edit*
Another way to test I just tried out:

1
2

    if(!std::strcmp("я","\321\217"))   // #include <cstring>
        std::cout << "Strings match";  // if this doesn't print -- source file is not UTF-8 encoded

*/edit*

The thing is.. the C/C++ languages don't tend to be very UTF-8 friendly (not unless you use an external lib for strings -- or write your own) because each codepoint is a variable number of "char"s (anywhere from 1-4... Cyrillic codepoints seem to be all 2 bytes wide). Note that while 'testb' in the above is really only a single character in UTF-8 encoding, it's actually two "char"s. This introduces a few other problems related to string length calculations ( strlen(testb) would return 2, even though the real length is 1 ).

If you do decide to go with UTF-8 (now or in the future), here's a very simple length function:

int utf8strlen(const char* p)
{
  int l;
  for(l = 0; *p != 0; ++p)
  {
    if( (*p & 0xC0) != 0x80 )
      ++l;
  }
  return l;
}

But go with whatever works. Right now it's looking like Windows-1251 is the way to go. I'm glad to see a solution!

Last edited on Apr 12, 2009 at 1:49am

Apr 12, 2009 at 3:59am

Duthomhas (13276)

Sorry I missed this topic, but you seem to have covered it all. MS "code pages" are a poor answer to international stuff, but we've got to live with it...

Yes, the Lucida Console font is the only MS Fixed-Point Unicode typeface to use. Unfortunately, programming the console in anything but English is still problematic.

The Wikipedia has a good read for dealing with "code pages":
http://en.wikipedia.org/wiki/Code_page

Remember to set your input CP as well as your output CP, and to restore them before your application terminates.

A very good UTF-8 library is ICU http://site.icu-project.org/

There are also some nice little UTF-8 handling libraries that various people do:
http://utfcpp.sourceforge.net/
http://www.codeproject.com/KB/string/utf8cpp.aspx
http://www.gnu.org/software/libidn/

Hope this helps.

Last edited on Apr 12, 2009 at 4:00am

Apr 12, 2009 at 6:36am

writetonsharma (1461)

I havent read the whole post..
but want to add a couple of things:

1. if you only wants to work on windows use TCHAR, its defined something like this:
#ifdef UNICODE
typedef wchar_t TCHAR
#else
typedef char TCHAR
#endif
use functions associated with TCHAR's only as they are defined the same way. like _tcscpy etc etc

2. to use unicode in your application you have to make your application unicode
remove mbcs from the settings and instead add: UNICODE, _UNICODE. Add both.

3. after adding this your application entry point will change, i dont remember but i think its _tmain.
and then you can start writting in local language. i dont know about console application but win32 application don't work in unicode until you make it unicode. if you dont make application unicode how hard you try you can write in local languages.

Apr 12, 2009 at 11:45am

Duthomhas (13276)

1. #include <tchar.h>

;-)

Apr 12, 2009 at 12:55pm

writetonsharma (1461)

oh yes... i missed one. :)

Apr 12, 2009 at 1:11pm

boenrobot (33)

Wow. Thanks for the new replies.

@writetonsharma
In my last post, I end up with a solution that works only for Windows - explicitly set the code page to cp866 (for both input and output, obviously) and use AnsiToOem() on all character sequences. I don't mind that as a solution, but it would be better if this could also work for Linux.

I'm not sure what you mean by "remove mbcs from the settings"... the compiler settings? My Current command line, as seen by Visual Studio's project properties is:

/Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MDd /Fo"Debug\\" /Fd"Debug\vc90.pdb" /W3 /nologo /c /Zi /TP /errorReport:prompt

There is a setting for "Character set" which is set to "Use Unicode Character Set".

I tried changing to _tmain(), but the results are the same. Using TCHAR (after including <tchar.h>, as Duoas suggested) instead of wchar_t doesn't make any difference. wcout sill breaks at wide characters, and cout still prints only the memory address.

Here's the code I tried:

#include <tchar.h>
#include <cstring>
#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>

using namespace std;

int _tmain() {
	UINT oldcodepage = GetConsoleOutputCP();
	char s1[] = "я";
	char s2[] = "\321\217";
	if(!strcmp(s1, s2)) {
		cout << "Strings match. File compiled as UTF-8.";
	}else {
		cout << "Strings \"" << s1 << "\" and \"" << s2 << "\" DO NOT match. File compiled as ANSI.";
	}

	cout << endl << oldcodepage << endl;
	cout << "Текст на кирилица" << endl;

	SetConsoleOutputCP(866);
	cout << "Текст на кирилица" << endl;

	char example866[] = "Текст на кирилица";
	AnsiToOem(example866, example866);
	cout << example866 << endl;

	SetConsoleOutputCP(65001);
	cout << "Текст на кирилица" << endl;

	char exampleUTF8[] = "Текст на кирилица";
	AnsiToOem(exampleUTF8, exampleUTF8);
	cout << exampleUTF8 << endl;

	SetConsoleOutputCP(1251);
	cout << "Текст на кирилица" << endl;

	char example1251[] = "Текст на кирилица";
	AnsiToOem(example1251, example1251);
	cout << example1251 << endl;

	SetConsoleOutputCP(oldcodepage);
	system("pause");
	return 0;
}

Edit & run on cpp.sh

The output of this with the default command promt font is:

Strings " " and "╤П" DO NOT match. File compiled as ANSI.
866
╥хъёЄ эр ъшЁшышЎр
╥хъёЄ эр ъшЁшышЎр
Текст на кирилица
Oaeno ia ee?eeeoa
Т??aa на ??a?л??а
╥хъёЄ эр ъшЁшышЎр
Текст на кирилица

and output with Lucida console is:

Strings " " and "╤П" DO NOT match. File compiled as ANSI.
866
╥хъёЄ эр ъшЁшышЎр
╥хъёЄ эр ъшЁшышЎр
Текст на кирилица
��
��
Текст на кирилица
’ҐЄбв ЄЁаЁ«Ёж

@Disch
When I open up the source file in binary view, "я" is indeed "D1 8F". The file is indeed UTF-8. It's just that the compiler doesn't compile it as UTF-8, probably because I use "char" and not wchar_t. But like we saw earlier, using wchar_t doesn't really help, since there's no working (standard) way to really output a wchar_t sequence.

@Duoas
I hate to sound like an idiot, but could you provide a sample code with any of those libraries? Nothing special, just a simple example like those above where you just take an ANSI string with a Cyrillic character in it, then convert it and output it.

BTW, as far using libraries goes, I've been using iconv (http://www.gnu.org/software/libiconv/) from within PHP, but when I just tried that now in C++... I'm not sure exactly what to include. In order to include <iconv.h>, I'd first have to add such a file in my compiler's library (right?), and I see no file named like that in the library, so there's nothing to include. I'm not holding out for iconv though, so if ICU, utf8cpp or any other library does the trick, so be it.

I prefer to avoid using libraries, but when there's no standard way of doing something (even if you'd think it should be in the standard, as in this case), using libraries is of course acceptable.

Last edited on Apr 12, 2009 at 1:26pm

Apr 12, 2009 at 1:35pm

writetonsharma (1461)

/Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MDd /Fo"Debug\\" /Fd"Debug\vc90.pdb" /W3 /nologo /c /Zi /TP /errorReport:prompt

thats perfect.. actually vs 2005 and above has this setting by default. VS 6.0 uses mbcs (multi-byte character set) as default.

secondly why are you using char, use TCHAR.

if still its giving problem, upload your basic source including the .vcproj and .sln files and the input file somewhere and i will download it and can check it for you.

Apr 12, 2009 at 1:43pm

writetonsharma (1461)

and more thing just give it a try using _tprintf.

Apr 12, 2009 at 2:32pm

boenrobot (33)

@writetonsharma
I've sent you the whole project folder (without the binary) at your email. It also contains the (failed) try with _tprintf().

Thanks.

Last edited on Apr 12, 2009 at 2:44pm

Apr 12, 2009 at 3:18pm

Disch (13742)

So the file is UTF-8... it's just the compiler isn't interpretting it as UTF-8.
*facepalm*

EDIT - blah -- I should read your posts before replying XD. Just realized I told you to do something you already did. Sorry.

Tinkering with wchar_t or tchars won't do a thing for you if you don't solve the underlying encoding problem. All those do is change the size of the character type... you'll still have the same problem when trying to output text. Although using wide chars might be easier in the long run since all your characters will likely have the same byte-length.

Anyway I dug up my Windows machine and have been playing around... I'm getting the same "console output dies as soon as I output Unicode" problem you described earlier. Aye-ya. Will play around a bit more and report back.

Last edited on Apr 12, 2009 at 3:39pm

Apr 12, 2009 at 5:24pm

writetonsharma (1461)

boenrobot:

check your mail, i have send the reply.

#include <tchar.h>
#include <iostream>
#include <fstream>
#include <windows.h>

using namespace std;

int wmain() 
{

	TCHAR *Buff;

	SetConsoleOutputCP(CP_UTF8);

	wifstream input("desiredOutput.txt",ios::binary);
	if(input.fail())
		return 0;

	input.seekg(0, ios::end);
	long size = input.tellg();
	input.seekg(0, ios::beg);

	Buff = new TCHAR[size + 1];

	input.read(Buff, size);
	
	_tprintf(_T("%s\n"),Buff);
	
	delete [] Buff;
	input.close();


	return 0;
}

Edit & run on cpp.sh

the trick is to use SetConsoleOutputCP :P

Pages: 12 3