UTF-8 in command prompt (console)

Pages: 123
Apr 15, 2009 at 3:51pm
wait........
its 32 for char type, it might be 16 for wchar_t/TCHAR type string. so put \0 at position 32.

*(Buff + (size/2)) = L'\0'; //try this also...
Apr 15, 2009 at 3:57pm
@Disch
1) Same deal. Both have the same problems really. In an advanced program, I'd use it, but for the purposes of getting output into the console, which one we use is irrelevant.

2) This is the only way UTF-8 can work on the console... or at least the only one which has been found so far. If there is a way from within the source, I'd prefer it, but if there isn't, storing in an extrernal file is acceptable.

3) I believe they are UTF-8 encoded, but I'm not sure. If you're right though, this may not be such a minor thing.

4) Yes, we tried that. But we weren't using a file, and _tprintf(). Those are what makes the difference.

5) One notable difference I saw is that _tprintf() doesn't break down, whereas wcout needs to manually be recovered after being fed a non ASCII character. Additionally, _tprintf() allows us to specify a format for the output as a first argument, which in the above examples is specified as _T("%s\n").
Apr 15, 2009 at 4:13pm
@writetonsharma
YES!!!

Manually putting '\0' at position 32 did the trick. With no BOM in place, the text file is now visible perfectly.

Now the only question is how to set the console font of Windows XP to a unicode aware font. If that works, this would indeed be the best possible solution.

Here's the final working code (when the font is set to Lucida console that is):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <iostream>
#include <fstream>
#include <windows.h>

using namespace std;

int main() 
{
	wchar_t *Buff;

	SetConsoleOutputCP(CP_UTF8);

	wifstream input("desiredOutput.txt", ios::binary);
	if(input.fail()) 
		return 0;

	input.seekg(0, ios::end);
	long size = input.tellg();
	input.seekg(0, ios::beg);

	Buff = new wchar_t[size + 1];
	Buff[size] = '\0';

	input.read(Buff, size);
	wprintf(L"%s\n", Buff);
	
	
	delete [] Buff;
	input.close();

	system("pause");
	return 0;
}

I have to ask (though I know this isn't what we're shooting for)... does using wprintf and wchar_t make this program nearly portable? Nearly being from the SetConsoleOutputCP() call, and whatever other call there may be for switching the font.
Last edited on Apr 15, 2009 at 5:57pm
Apr 16, 2009 at 9:01am
oh.. great..

yes correct.. using wide string functions we can make the program portable.. but i dont know other complexities.. as the same program is compiling perfectly on rhel 5.0 but not displaying anything.. i tried all the combination's but it is not reading anything from the file.. i think i need to some extra things... dont know..!!

might be Disch know something regarding this.. what you say Disch???


did you try setting the font of the console using that function?? or do you want me to try..!!!
Apr 16, 2009 at 9:12am
I would, but even if I succeed, you should still try it on XP (as I can't). If it doesn't work there, there needs to be an alternative we haven't found yet.

Besides... how do I use it? How to get the necessary handles and font descriptions? Threre is not even a single example on the documentation page.
Apr 16, 2009 at 9:49am
code is not difficult to use.. i think you havent used win32 api's thats why you are saying this.
I will try and post the code..
Apr 17, 2009 at 1:55am
Isn't there a popular library for console stuff? "ncurses" or something? Surely that would be able to support Unicode, I'd hope.

As for why it isn't working @ writetonsharma... I have no idea. In fact when I tried compiling and running some of the samples boenrobot gave on my windows machine, I got different output than he did, and couldn't get proper output in any configuration I tried.

Makes me wonder why MS even bothered adding that UTF-8 option to SetConsoleOuputCP when it clearly doesn't work. Maybe it's just to tease us, or make it look like they support something they actually don't. =(
Last edited on Apr 17, 2009 at 1:56am
Apr 17, 2009 at 4:19am
hmmm...ok..

Actually unicode is not for console..
unicode is mostly used in GUI applications..

like multi-lingual office suites, web browsers, etc etc.. why one want to use unicode on console is a little strange.. and thats why i dont think there will be any popular libraries for console..
though all the GUI libraries support unicode, VC++, QT etc etc..

Apr 17, 2009 at 11:12am
Well, the libraries that Duoas gave at the first page seem promising... the only way to cross platformly support UTF-8 in a console would be to use a library that translates a UTF-8 string to a corresponding ANSI and outputs it... isn't that what those libraries do? The thing is... how to use them in Visual Studio? They seem to be "optimized" for Linux (gcc?), with those "make ..." and "configure ..." stuff that is required to install them.

And BTW, yes, you're right. I haven't used the win32 API. My knowledge of C++ is (nearly?) non existent when it comes to anything that is not in the "std" namespace (or in other words - anything that is not in this site's reference).
Apr 17, 2009 at 6:48pm
Actually unicode is not for console..
unicode is mostly used in GUI applications..
[snip]
why one want to use unicode on console is a little strange


Unicode is just a standardized way to represent text consisting of virtually any character, with each character having a unique identifier. Like an "all in one" character set. The alternative to this is to yutz around with locale settings in order to get anything beyond basic Latin characters and symbols. This may not seem like a big deal if you're an English speaker because ASCII has the entire English alphabet, but it's really a big mess.

The reason to use Unicode in a console program is no different than the reason to use Unicode in any other kind program.

Say you make a simple console program to print a file, that works via commandline:

printfile <filename>

Do you really want 'filename' to just be ASCII? That will make your program unusable (or at least more difficult to use) for printing files that might have foreign characters in the name.

Really, there's little reason not to use Unicode all the time (other than its nonexistant standard lib support).

the only way to cross platformly support UTF-8 in a console would be to use a library that translates a UTF-8 string to a corresponding ANSI and outputs it... isn't that what those libraries do?


I don't know for sure. I'm positive that Windows (and Linux, Mac, and any other OS worth using) use Unicode internally, though. So I don't really think any conversion is necessary because the OS will ultimately have to convert it back to Unicode. This is why I'm so dumbfounded that it's so hard to get Unicode to output to the Win console -- you'd think it'd be easy!

Whether or not conversion is done, though, is another matter. Duoas said that wcout 'narrows' the string you give it before outputting it (*facepalm* then wtf is the point?), so standard libs might be doing some conversion stuff before they hand the data off to the actual OS, which might convert it to something else. I'm starting to think that maybe the way to go is to bypass standard libs completely and stick with OS system calls (but hide them behind an abstract interface so you can port to other platforms by simply writing a new version of that interface). Maybe if WinAPI has SetConsoleOutputCP, there's functions to output to the console that don't use the standard libs (like ConsoleOut()) or something. I'll have to look into that.

</rambling>


EDIT
-------------------------------------

There are, in fact, WriteConsole and ReadConsole WinAPI functions. I'm willing to bet that SetConsoleOutputCP will actually work with these functions... so UTF-8 is likely possible.

I say use UTF-8 text, and wrap output in a container class. Portability + internationalization + consistent output = win.

Here's a simple idea of what the container class for Windows might look like (but I didn't try to compile this, as I'm not on Windows, so this might not work at all)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
class Console
{
public:
  Console()
  {
    hInHandle = GetStdHandle(STD_INPUT_HANDLE);
    hOutHandle = GetStdHandle(STD_OUTPUT_HANDLE);
    uInCP = GetConsoleCP();
    uOutCP = GetConsoleOutputCP();
    SetConsoleCP(65001);
    SetConsoleOutputCP(65001);
  }
  ~Console()
  {
    SetConsoleCP(uInCP);
    SetConsoleOutputCP(uOutCP);
  }

  void Out(const char* text)
  {
    DWORD t;
    WriteConsole( hOutHandle, text, std::strlen(text), &t, NULL );
  }

  // for input do something similar with hInHandle -- too lazy to write that routine

private:
  // to disallow copying
  Console(const Console&);
  Console& operator = (const Console&);
};

//-----------------------------------------------------
//  to use

int main()
{
  Console c;

  char someunitext[] = "Текст на кирилица"  // must be UTF-8 encoded
          //  I hope your compiler doesn't bork this

  c.Out(someunitext);

  return 0;
}


Hopefully that'll work. Try it and see. I'll keep my fingers crossed.

Basically we're having to rewrite cout to be less stupid about Unicode.

*shakes fist at the C++ standard libs*

ANOTHER EDIT:

There are "Unicode" and "ANSI" versions of WriteConsole and ReadConsole (WinAPI does this lots of their functions). Basically the real functions are WriteConsoleW or WriteConsoleA, and 'WriteConsole' just gets #defined as one of them depending on whether or not UNICODE was defined.

Since we're using UTF-8 and char* here, we actually might want the non-Unicode version. So instead of WriteConsole, you might want to use WriteConsoleA. Try all 3 and see which work and which don't.

Or, you could use the wchar_t version, but I'd avoid that because the width of wchar_t's vary greatly on other platforms (is it UTF-16? UTF-32? no way to know -- but we can assume to treat char* as always UTF-8)
Last edited on Apr 17, 2009 at 7:27pm
Apr 17, 2009 at 11:11pm
With some small corrections, this compiled, but didn't worked, in that the characters were crappy as usual in such cases.

For the sake of completeness, here's the code that compiled:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#include <cstring>
#include <iostream>
#include <locale>
#include <windows.h>
#include <cstdlib>

using namespace std;

class Console
{
public:
	Console()
	{
		this->hInHandle = GetStdHandle(STD_INPUT_HANDLE);
		this->hOutHandle = GetStdHandle(STD_OUTPUT_HANDLE);
		this->uInCP = GetConsoleCP();
		this->uOutCP = GetConsoleOutputCP();
		SetConsoleCP(65001);
		SetConsoleOutputCP(65001);
	}
	~Console()
	{
		SetConsoleCP(uInCP);
		SetConsoleOutputCP(uOutCP);
	}

	void Out(const char* text)
	{
		DWORD t;
		WriteConsole( hOutHandle, text, std::strlen(text), &t, NULL );
	}

  // for input do something similar with hInHandle -- too lazy to write that routine

private:
	UINT uInCP, uOutCP;
	HANDLE hInHandle, hOutHandle;
	// to disallow copying
	Console(const Console&);
	Console& operator = (const Console&);
};

//-----------------------------------------------------
//  to use

int main()
{
	Console c;

	char someunitext[] = "Текст на кирилица";  // must be UTF-8 encoded
          //  I hope your compiler doesn't bork this

	c.Out(someunitext);
	system("pause");
	return 0;
}
Apr 18, 2009 at 12:37am
Doh! forgot to declare the member vars. Knew I was missing something.

Anyway I decided to hook up my Windows machine again to test (I really need another monitor/keyboard/mouse so it isn't such a hassle). After switching to WriteConsoleA instead of WriteConsole it worked just fine for me on Windows 2000, VS.NET 2002.

But I recall you had a UTF-8 problem:

std::strcmp("я","\321\217")

That was not coming back a match when it should be. If you're still getting that, it has to be a compiler option somewhere. There's no reason it shouldn't be a match if the file is saved as UTF-8 (which I recall you verified).

I don't know if this will help, but here's my compiler options as displayed in VS's project settings:


/Od /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm /EHsc /RTC1 /MLd /Fo"Debug/" /Fd"Debug/vc70.pdb" /W3 /nologo /c /Wp64 /ZI /TP


But keep in mind I'm on an older version of VS, so the meaning of my options might not match yours.

--------------------------

While you're looking at that I'll try and make this more cout-like (with << operator and whatnot)


edit:

blech -- having too hard a time trying to derive from ostream. overloading << for const char* and string is simple enough, but that won't cut it for number formatting and stuff. Maybe I'll work more on this tomorrow.
Last edited on Apr 18, 2009 at 1:37am
Apr 19, 2009 at 9:56am
my god so much of updates...
my cousins marriage coming up.. that's why not keeping me update with this post..

will see in a couple of days whats going on.. and where are we heading.. :)
Topic archived. No new replies allowed.
Pages: 123