What is the encoding of CString?

Pages: 12
CString is a wchar_t based string type; what encoding does it use? how can one translate from CString to std::string?


Thanks!
Juan
CString simply is an alias for CStringT<TCHAR, ...>.

TCHAR is defined as either wchar_t or char, depending on whether you project is configured with the Unicode or Multi-Byte character set.

There also are CStringA and CStringW that are defined as CStringT<char, ...> and CStringT<wchar_t, ...>, respectively.

Typically, char-based strings use whatever character-encoding is configured as "ANSI" Codepage on the system where the program runs.

And wchar_t-based strings typically use Unicode character set with UTF-16 encoding. At least on Windows.
Last edited on
Ok, but if TCHAR is chosen as wchar_t, what is CString encoding UTF-16 or Unicode? and how can we translate from this wide character string to std::string?


Unicode is character set that assigns a unique number ("code-point") to each character.

How those Unicode characters (code-points) actually are stored/transferred, that is defined by the specific encoding!

UTF-16 is one such Unicode encoding. UTF-8 is another popular Unicode encoding.

As said before, on Windows, where wchar_t is 16-Bit in size (per character), UTF-16 is typically used for "Unicode" strings.

______

std::string simply is a wrapper for a sequence of char's. It can store whatever "multi-byte" character encoding that you like 😄

Possibilities include UTF-8 (Unicode) or Latin-1 (ISO 8859-1).

As far as the Win32 API is concerned, functions dealing with char-strings assume the "ANSI" Codepage configured on the local system.

You can use GetACP() to detect the "ANSI" Codepage that is configured on the current system...

________

To convert between char-based on wchar_t-based strings, see here:

https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar
https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte
Last edited on
Ok to all that info... but still unanswered is "how can we translate from this wide character CString to std::string?"

How about this code:


1
2
3
4
5
6
7
 inline std::string to_string(const std::wstring& str, const std::locale& loc = std::locale{})
	{
		std::vector<char> buf(str.size());
		std::use_facet<std::ctype<wchar_t>>(loc).narrow(str.data(), str.data() + str.size(), '?', buf.data());

		return std::string(buf.data(), buf.size());
	}


??
Last edited on
This totally depends on two things:

1. Is your CString actually CStringT<char> or CStringT<whcar_t>?

2. What character encoding do you want your std::string to be encoded in? Latin-1? UTF-8? User's local "ANSI" codepage?
I get this deprecated in C++ 17!!
Ok:

CString is actually CStringT<wchar_t>.

What would be the solution if the desired character encoding was:

1- Latin-1 ???
2- UTF-8 ???
3- user's local ANSI codepage ???

how would we program these 3 ways of translating the CStrings?


What is the encoding of CString?

It depends, on what character type was used to create the CString.

CString does not have to use wchar_t as the type, it can use char. Or TCHAR.

Choose one or the other and explicitly declare your CString type to be either char or wchar_t.

https://docs.microsoft.com/en-us/cpp/atl-mfc-shared/reference/cstringt-class?view=msvc-170

You can then get a C string from that CString, which can then be plugged into creating a C++ std::string.

https://docs.microsoft.com/en-us/cpp/atl-mfc-shared/cstring-operations-relating-to-c-style-strings?view=msvc-170#_core_using_cstring_as_a_c.2d.style_null.2d.terminated_string
Try something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
static std::string convert(const CStringW str, const int targetEncoding)
{
    std::string result;
    const int size = WideCharToMultiByte(targetEncoding, 0U, str.GetString(), str.GetLength(), NULL, 0, NULL, NULL);
    if (size > 0)
    {
        std::vector<char> buffer(size);
        const int ret = WideCharToMultiByte(targetEncoding, 0U, str.GetString(), str.GetLength(), buffer.data(), (int)buffer.size(), NULL, NULL);
        if ((ret > 0) && (ret <= size))
        {
            result = std::string(buffer.cbegin(), buffer.cend());
        }

    }
    return result;
}

int main()
{
    CStringW input(L"Hello W0rld!");
    std::cout << '"' << convert(input, CP_UTF8) << '"' << std::endl;
}


Above function takes as input a CStringW, assuming that it contains an Unicode string, in UTF-16 encoding.

The desired output encoding can be selected by the parameter. I choose CP_UTF8 (UTF-8) in the example.
Last edited on

You can then get a C string from that CString, which can then be plugged into creating a C++ std::string.


How can we get a C String from a CStringT<wchar_t>??? as far as I know, its not possible!!
Last edited on
How can we get a C String from a CStringT<wchar_t>??? as far as I know, its not possible!!

See example I posted above 😏
ok!!

Question where did you get the value for UTF-8 (CP_UTF8)? from which header?
Last edited on
It's defined by <Windows.h>, or by something that implicitly gets included when <Windows.h> is included.

You need to include <Windows.h> anyway, for WideCharToMultiByte() function.

But, as said before, if you want the default "ANSI" Codepage of the local system, you can simply use GetACP() function.
Last edited on
In WinNls.h I found this:



1
2
3
4
5
6
7
8
9
10
11
//  Code Page Default Values.
//  Please Use Unicode, either UTF-16 (as in WCHAR) or UTF-8 (code page CP_ACP)
//
#define CP_ACP                    0           // default to ANSI code page
#define CP_OEMCP                  1           // default to OEM  code page
#define CP_MACCP                  2           // default to MAC  code page
#define CP_THREAD_ACP             3           // current thread's ANSI code page
#define CP_SYMBOL                 42          // SYMBOL translations

#define CP_UTF7                   65000       // UTF-7 translation
#define CP_UTF8                   65001       // UTF-8 translation 


So for UTF-8 what to use: CP_ACP or CP_UTF8??


CP_UTF8 means UTF-8.

CP_ACP means "whatever happens to be configured as the 'ANSI' Codepage on the local machine"


Note: On an English system, CP_ACP probably is Windows-1252, but it can be changed in the Windows control panel to something else.

Don't make any assumptions about what the local "ANSI" Codepage might be. It can be different on each computer!
Last edited on
@kigar64551 great answer! But i have another corresponding question: what would be the code to convert std::string to CStringW? and, is there no way to do both of these conversions using only standard C++ 20?

Last edited on
what would be the code to convert std::string to CStringW?

Use MultiByteToWideChar() function:
https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar

Given my example on how to use WideCharToMultiByte(), you will probably figure it out...

(it helps to read the Microsoft documentation on these functions!)
Last edited on
don't forget string has a wide version, so you could just use that without conversion: std::wstring maybe work for whatever you are doing?
Is this function correct? Is a null byte being added to the end? do I have dangling references by initializing result with c_str()?



1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
	static CStringW convert(const std::string str, const int targetEncoding)
	{
		CStringW result;
		const int size = MultiByteToWideChar(targetEncoding, 0U, str.c_str(), str.length(), NULL, 0);
		if (size > 0)
		{
			std::vector<wchar_t> buffer(size);
			const int ret = MultiByteToWideChar(targetEncoding, 0U, str.c_str(), str.length(), buffer.data(), (int)buffer.size());
			if ((ret > 0) && (ret <= size))
			{
				result = std::wstring(buffer.cbegin(), buffer.cend()).c_str();
			}
		}
		return result;
	}


??
Pages: 12