Priting multi-bytes characters

Forum

Forum
General C++ Programming
Priting multi-bytes characters

Priting multi-bytes characters

Dec 15, 2012 at 7:01pm

The 我 character UTF-8 code units are: E6 88 91.

Suppose that I have these hexadecimal values stored in an array:

unsigned char bytesArray[3];
bytesArray[0] = '0xE6';
bytesArray[1] = '0x88';
bytesArray[2] = '0x91';

Given this representation of a multi-bytes character, I would like to print the corresponding glyph (我) on the console.
How would you suggest me to do that?

Thank you for helping.

Dec 15, 2012 at 7:13pm

Disch (13742)

Sadly this is OS dependant. On *nix terminals I believe it outputs utf-8 by default. But on Windows I believe you have to change a setting. I forget exactly how to do it though.

Dec 15, 2012 at 7:23pm

Cubbi (4774)

This is becoming a frequently-asked question recently

On Linux, and other systems that support UTF-8 at the console driver level, just print it:

#include <iostream>

int main()
{
    char bytesArray[3] = {'\xE6', '\x88', '\x91'};
    std::cout.write(bytesArray, sizeof bytesArray);
}

Edit & run on cpp.sh

demo: http://ideone.com/BdRpfZ

On more strict systems, you'd have to enable the locale to choose the correct format (after all, why default to UTF-8? It could've been GB18030 just as well). Locale names are OS-dependent. I'm using the POSIX locale for US English below, but any UTF-8 locale would work the same way.

#include <iostream>
#include <locale>

int main()
{
    char bytesArray[3] = {'\xE6', '\x88', '\x91'};
    std::locale::global(std::locale("en_US.utf8"));
    std::cout.imbue(std::locale());
    std::cout.write(bytesArray, sizeof bytesArray);
}

Edit & run on cpp.sh

That is how C++ is supposed to work with Unicode. (C as well, for that matter, printf() and scanf() deal in multibyte sequences)

Now, on systems that did not bother implementing Unicode for their console output, you have to convert from UTF-8 to wide string, and then output using standard wide character functionality (which also requires a locale to be set)

You can do it C++11 way

#include <iostream>
#include <locale>
#include <codecvt>
#include <string>

int main()
{
    char bytesArray[] = {'\xE6', '\x88', '\x91'};
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
    std::wstring wide = conv.from_bytes(bytesArray,
                                        bytesArray + sizeof bytesArray);

    std::locale::global(std::locale("en_US.utf8"));
    std::wcout.imbue(std::locale());
    std::wcout << wide << '\n';
}

Edit & run on cpp.sh

(tested with clang++ on Linux)

Or C way

#include <iostream>
#include <locale>
#include <cwchar>

int main()
{
    std::locale::global(std::locale("en_US.utf8"));
    std::wcout.imbue(std::locale());

    char bytesArray[] = {'\xE6', '\x88', '\x91'};

    std::mbstate_t state = std::mbstate_t();
    const char* end = bytesArray + sizeof bytesArray;
    const char* ptr = bytesArray;
    int len;
    wchar_t wc;
    while( (len = std::mbrtowc(&wc, ptr, end-ptr, &state)) > 0)
    {
        std::wcout << wc;
        ptr += len;
    }
}

Edit & run on cpp.sh

(tested with gcc on Linux)

On Windows, you can do C++11 or C way, but you also have to enable wide character output on console using its special non-portable method

#include <iostream>
#include <codecvt>
#include <string>
#include <fcntl.h>
#include <io.h>

int main()
{
    char bytesArray[] = {'\xE6', '\x88', '\x91'};
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
    std::wstring wide = conv.from_bytes(bytesArray,
		                        bytesArray + sizeof bytesArray);

    _setmode(_fileno(stdout), _O_WTEXT);
    std::wcout << wide << '\n';    
}

Edit & run on cpp.sh

tested with Visual Studio 2012 but I remember this working with 2010 as well. Note that default console fonts on most installations of Windows do not include those characters. Either get such font, or just print your output to a file, which you can then open with Notepad (but you'll need more than just one Chinese character for autodetection to realize it's dealing with Unicode in this case)

Last edited on Dec 15, 2012 at 8:08pm

Dec 16, 2012 at 11:56am

Lea Massiot (8)

Thank you very much for this answer.

Topic archived. No new replies allowed.