What is the encoding of CString?

Pages: 12
This code produces strange output:


1
2
3
4
5
6
7
		CStringW input(L"Muñoz!");
		std::string res = convert(input, CP_UTF8);  
		CStringW output = convert(res, CP_UTF8);

		std::cout << '"' << res << "\" " << std::endl; 
		std::wcout << output.GetBuffer() << std::endl;



This is the output:


"Mu├▒oz!"
Mu±oz!




Why this? What am I doing wrong?


Is this function correct?

I think there is no reason to create a temporary std::wstring object. Try like this directly:
result = CStringW(buffer.data(), ret);

Also, parameter "targetEncoding" should probably be called "sourceEncoding" here, because that's what it is. MultiByteToWideChar() always converts to UTF-16; it is the input encoding (i.e. the encoding of the given multi-byte string) that you can control.

Is a null byte being added to the end?
MultiByteToWideChar() does not append a terminating NULL character to the output, if the input length is given explicitly and if that given length doesn't include a terminating NULL characters. Indeed, std::string::length() does not count the terminating NULL character!

But there is no problem, as long as we use the CStringW constructor that explicitly takes the string length as argument 😏
Last edited on
"Mu├▒oz!"
Why this? What am I doing wrong?

Avoid having non-ASCII characters in your source code. The text encoding of the source code file itself may fool you!

Try like this:
1
2
const wchar_t *const my_string = L"Mu\x00F1oz!";
MessageBoxW(NULL, my_string, L"Test", MB_OK);

Note that 0x00F1 is the UTF-16 encoding of "Latin Small Letter N with Tilde" (ñ). See here for details:
https://www.compart.com/en/unicode/U+00F1
Last edited on
Getting the terminal (console) to correctly show "foreign" characters is a whole different story 😠

Try something like this:
1
2
3
4
5
6
CStringW input(L"Mu\x00F1oz!");
std::string str_utf8 = convert(input, CP_UTF8);
    
DWORD written;
SetConsoleOutputCP(CP_UTF8);
WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE), str_utf8.data(), str_utf8.length(), &written, NULL);

Note: We use SetConsoleOutputCP() to tell the console which character encoding it should use to interpret the bytes that we are going to print. In this case, we need to set up UTF-8. Also note that we print directly with WriteConsoleA() function!

(Printing via C-library functions like std::cout() or printf() adds yet another layer where your string may get messed up)
Last edited on
How can we get a C String from a CStringT<wchar_t>??? as far as I know, its not possible!!

It is VERY possible. You declare your C string to be either a char array OR a wchar_t array.

What character type used for your C string depends on what your CString type is. The two have to match.
@kigar64551 ok great for using Windows functions... however I am trying to stick to standard C++ 20 and would like to create a locale for CP_UTF8 encoding say locUTF8 and then do a
 
std::cout.imbue(locUTF8) 


so that

 
std::cout << str_utf8 ;



works as well as your purely Windows solution above. How can we create a locale for a specific code page?


Also what if I want to call MessageBoxW() with the UTF-16 encoded code? I tried calling MessageBoxW() with CStringW rr = L"Mu\x00F1oz" and it works but if I encode this as UTF-16 it displays garbage!!






Last edited on
MessageBoxW() is the "wide string" version of MessageBox(), i.e. it requires a wchar_t* string and it assumes the UTF-16 encoding.

And that is exactly the reason why it does work with your CStringW rr = L"Mu\x00F1oz" – which is encoded in UTF-16.

Meanwhile, MessageBoxA() is the "multi-byte" (ANSI) version of MessageBox(), i.e. it does require a char* string and it assumes that the given string is encoded in whatever "ANSI" Codepage that happens to be configured on the local system...
Last edited on
how can I find out which ANSI codepage is configured on the local system?
Can we change programmatically to use a different codepage?

how can I find out which ANSI codepage is configured on the local system?

As I said before, GetACP() provides that info.

Can we change programmatically to use a different codepage?

It is configured in the Control Panel of Windows. Maybe you could somehow change it "programmatically" via some registry entry. But a reboot would probably be required to make the change take effect. Also it would effect all applications on the system!

https://i.imgur.com/YwUYDTw.png

Note: With "non-Unicode programs" Windows means anything that uses mutli-byte (ANSI) strings, as opposed to UTF-16 (Unicode).
Last edited on
Thank you. Last question: can we change console output to CP_UTF8 but using only C++ standard instead of windows functions? How do we build the locale to specify UTF8 and then imbue that to std::cout?

I am sorry to bother you so much! I really appreciate all your help!!



I think this is very platform-specific to Windows and probably can't be done with "generic" C++ functions alone.

What you actually can do:
1
2
3
4
5
6
7
8
9
10
#include <fcntl.h>
#include <io.h>
#include <iostream>

int main()
{
    _setmode(_fileno(stdout), _O_U8TEXT);
    CStringW input(L"Mu\x00F1oz!");
    std::wcout << input.GetString() << std::endl;
}


Note: We can send a wide-string (UTF-16) to std::wcout, but – by default – the C Runtime still is going to translate the given string to the local ANSI codepage before it is send to the stdout (terminal), thus messing up any Unicode characters that can't be represented in the local ANSI codepage! We change this behavior with the _setmode() function and thus force the stdout to output UTF-8 to the terminal 😏

The required SetConsoleOutputCP(CP_UTF8) appears to be implicit with this solution.

BUT: Be aware that, once _setmode() was used to set the stdout to UTF-8 mode, only writing to std::wcout works, whereas any attempt to write to std::cout will cause an abnormal program termination. I don't think Microsoft has ever fixed this... 🙄

See also:
https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setmode?view=msvc-170#remarks

________

Also note that certain "foreign" characters still won't appear correctly in the terminal, even if all of the above is done correctly, because the "monospace" (typewriter) font used by the terminal may not support these characters 🙄🙄🙄
Last edited on
Topic archived. No new replies allowed.
Pages: 12