Wow. Thanks for the new replies.
@writetonsharma
In my last post, I end up with a solution that works only for Windows - explicitly set the code page to cp866 (for both input and output, obviously) and use AnsiToOem() on all character sequences. I don't mind that as a solution, but it would be better if this could also work for Linux.
I'm not sure what you mean by "remove mbcs from the settings"... the compiler settings? My Current command line, as seen by Visual Studio's project properties is:
/Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MDd /Fo"Debug\\" /Fd"Debug\vc90.pdb" /W3 /nologo /c /Zi /TP /errorReport:prompt |
There is a setting for "Character set" which is set to "Use Unicode Character Set".
I tried changing to _tmain(), but the results are the same. Using TCHAR (after including <tchar.h>, as Duoas suggested) instead of wchar_t doesn't make any difference. wcout sill breaks at wide characters, and cout still prints only the memory address.
Here's the code I tried:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
|
#include <tchar.h>
#include <cstring>
#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>
using namespace std;
int _tmain() {
UINT oldcodepage = GetConsoleOutputCP();
char s1[] = "я";
char s2[] = "\321\217";
if(!strcmp(s1, s2)) {
cout << "Strings match. File compiled as UTF-8.";
}else {
cout << "Strings \"" << s1 << "\" and \"" << s2 << "\" DO NOT match. File compiled as ANSI.";
}
cout << endl << oldcodepage << endl;
cout << "Текст на кирилица" << endl;
SetConsoleOutputCP(866);
cout << "Текст на кирилица" << endl;
char example866[] = "Текст на кирилица";
AnsiToOem(example866, example866);
cout << example866 << endl;
SetConsoleOutputCP(65001);
cout << "Текст на кирилица" << endl;
char exampleUTF8[] = "Текст на кирилица";
AnsiToOem(exampleUTF8, exampleUTF8);
cout << exampleUTF8 << endl;
SetConsoleOutputCP(1251);
cout << "Текст на кирилица" << endl;
char example1251[] = "Текст на кирилица";
AnsiToOem(example1251, example1251);
cout << example1251 << endl;
SetConsoleOutputCP(oldcodepage);
system("pause");
return 0;
}
|
The output of this with the default command promt font is:
Strings " " and "╤П" DO NOT match. File compiled as ANSI.
866
╥хъёЄ эр ъшЁшышЎр
╥хъёЄ эр ъшЁшышЎр
Текст на кирилица
Oaeno ia ee?eeeoa
Т??aa на ??a?л??а
╥хъёЄ эр ъшЁшышЎр
Текст на кирилица |
and output with Lucida console is:
Strings " " and "╤П" DO NOT match. File compiled as ANSI.
866
╥хъёЄ эр ъшЁшышЎр
╥хъёЄ эр ъшЁшышЎр
Текст на кирилица
����� �� ��������
����� �� ��������
Текст на кирилица
’ҐЄбв ЄЁаЁ«Ёж |
@Disch
When I open up the source file in binary view, "я" is indeed "D1 8F". The file is indeed UTF-8. It's just that the compiler doesn't compile it as UTF-8, probably because I use "char" and not wchar_t. But like we saw earlier, using wchar_t doesn't really help, since there's no working (standard) way to really output a wchar_t sequence.
@Duoas
I hate to sound like an idiot, but could you provide a sample code with any of those libraries? Nothing special, just a simple example like those above where you just take an ANSI string with a Cyrillic character in it, then convert it and output it.
BTW, as far using libraries goes, I've been using iconv (
http://www.gnu.org/software/libiconv/) from within PHP, but when I just tried that now in C++... I'm not sure exactly what to include. In order to include <iconv.h>, I'd first have to add such a file in my compiler's library (right?), and I see no file named like that in the library, so there's nothing to include. I'm not holding out for iconv though, so if ICU, utf8cpp or any other library does the trick, so be it.
I prefer to avoid using libraries, but when there's no standard way of doing something (even if you'd think it should be in the standard, as in this case), using libraries is of course acceptable.