which is the best way to convert a wstri

http://coliru.stacked-crooked.com/a/9a0db54094421729

I have the following which works but is quite ugly and overly complex seems to me.

auto& f = std::use_facet<std::codecvt<wchar_t, char, std::mbstate_t>>(std::locale());

wstring internal{ L"Hello ¢" };

std::mbstate_t mb{}; // initial shift state
std::string external(internal.size() * f.max_length(), '\0');
const wchar_t* from_next;
char* to_next;
f.out(mb, &internal[0], &internal[internal.size()], from_next,
	&external[0], &external[external.size()], to_next);
// error checking skipped for brevity
external.resize(to_next - &external[0]);

Sep 30, 2018 at 5:10am

JLBorges (13770)

Though deprecated, <codecvt> and std::wstring_convert will be around for a while.

#include <locale>
#include <codecvt>

std::string to_string( const std::wstring& wstr )
{
    static std::wstring_convert< std::codecvt_utf8<wchar_t>, wchar_t > converter ;
    
    return converter.to_bytes(wstr) ;
}

Oct 1, 2018 at 6:27pm

I tried this:

auto res = to_string(L"Hello ¢");

and the result was:


Hello Â¢

Where did that strange version of A come from? Is this correct?

Also I tried to use wstring_convert to convert from string to wstring and got an exception. The exact code is:

1
2

static std::wstring_convert< std::codecvt_utf8<wchar_t>> converter;
return converter.from_bytes(str);

What is going on?

Regards,
Juan

Oct 1, 2018 at 6:47pm

Originally I had these functions for converting between string and wstring:

// convert string to wstring
inline std::wstring to_wstring(const std::string& str, const std::locale& loc = std::locale{})
{
	std::vector<wchar_t> buf(str.size());
	std::use_facet<std::ctype<wchar_t>>(loc).widen(str.data(), str.data() + str.size(), buf.data());

	return std::wstring(buf.data(), buf.size());
}

// convert wstring to string
inline std::string to_string(const std::wstring& str, const std::locale& loc = std::locale{})
{
	std::vector<char> buf(str.size());
	std::use_facet<std::ctype<wchar_t>>(loc).narrow(str.data(), str.data() + str.size(), '?', buf.data());

	return std::string(buf.data(), buf.size());
}

but Cuddi thought they were not the best way to go about it. Instead he said to use codecvt_utf8...

Oct 1, 2018 at 7:05pm

the result was:
Hello Â¢

How do you observe that result? I am going to guess you're using some Windows compiler and attempting to print a Unicode string on the console, which, on Windows, requires non-trivial setup and is a whole different topic.

I tried to use wstring_convert to convert from string to wstring and got an exception

What exception exactly? What was the input string? If the exception was std::range_error, it means your input was invalid, which may happen when using string literals on e.g. misconfigured Visual Studio, and you can verify that by examining the input string in the debugger.

Last edited on Oct 1, 2018 at 7:14pm

Oct 1, 2018 at 8:56pm

The result is observed in the Visual Studio debugger.
Here is the call with the input causing an exception:

auto res2 = to_wstring("Hello ¢");

The exception is "bad conversion" thrown by the last line of this code:

1
2

std::wstring_convert< std::codecvt_utf8<wchar_t>> converter;
return converter.from_bytes(str);

Regards,
Juan

Oct 1, 2018 at 8:59pm

Also, when converting from string to wstring isn't the effect equivalent to inserting 1 byte with value 0 before each original character in the string? --- in other words nothing is lost and there should not be a conversion problem....

Oct 2, 2018 at 2:49am

poteto (525)

codecvt_utf8_utf16

Huh, I guess this is codecvt's overly complex way of handling conversions on linux to use UCS2 bits instead of UCS4 while having the same result on windows with codecvt_utf8... But it doesn't matter since linux uses utf-8 in a way you never need to convert...

https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

Overall on windows it is usually easier to make all your strings to be wide for internationalization, but when you think about it the actual "internationalization" part is much more complicated than using utf-8, and you need to put all your translations into a xml/json or use a library like ICU or gettext...

So at the end of the day, codecvt is probably not what you are looking for, perhaps what you are looking for is the proprietary windows conversion like WideCharToMultiByte and use CP_ACP as a reference to your current codepage, and just use the limited narrow encoding on windows (and if you choose to use window's CMD window on windows 7, CP65001 AKA utf-8 codepage is a lie and doesn't actually work, not sure about the support on win 8 or 10, so if you want to have almost unicode text in the prompt, you need to use the narrow text unless you want to convert utf-8 to a codepage encoding, and note that when you put in a wide string into wcout, it internally just gets converted to the narrow encoding before printing).

Last edited on Oct 2, 2018 at 5:47pm

Oct 2, 2018 at 1:43pm

JUAN DENT wrote:
observed in the Visual Studio

Finally, the most crucial piece of information.

JUAN DENT wrote:
when converting from string to wstring isn't the effect equivalent to inserting 1 byte with value 0 before each original character in the string?

No, not at all.

For the strings in your example,
* "Hello ¢" has size 8 and consists of 0x48 0x65 0x6c 0x6c 0x6f 0x20 0xc2 0xa2 (this is what I meant when I said "you can verify that by examining the input string". If it isn't those 8 bytes, your Visual Studio is misconfigured. Possibly missing /utf-8 in Project->Properties->C/C++->Command Line)
* L"Hello ¢" has size 7 and consists of 0x0048 0x0065 0x006c 0x006c 0x006f 0x0020 0x00a2 (because Microsoft still thinks it's 1995 and wchar_t is 16-bit.. but it's fine for your examples)
* Conversion between the two works with codecvt_utf8 or codecvt_utf8_utf16 - same thing

here I ran this on a Visual Studio 2015:

#include <cassert>
#include <locale>
#include <codecvt>
int main()
{
	std::string s = "Hello ¢";
	std::wstring ws = L"Hello ¢";
	assert(s.size() == 8);
	assert(ws.size() == 7);
	std::wstring_convert< std::codecvt_utf8<wchar_t>, wchar_t > cvt;
	assert(s == cvt.to_bytes(ws));
	assert(ws == cvt.from_bytes(s));
}

Edit & run on cpp.sh

and it compiled and ran with no errors.

Last edited on Oct 2, 2018 at 1:44pm

Oct 2, 2018 at 6:46pm

I ran your code and the first assert fails! I also added /utf-8 in Project->Properties->C/C++->Command Line and it changes nothing.
Examining the string s, it contains only 7 characters consisting exactly of:

0x48 0x65 0x6c 0x6c 0x6f 0x20 0xa2

So, where do I configure Visual Studio 2017?

Thanks for your time...

Oct 2, 2018 at 7:05pm

JUAN DENT wrote:
Examining the string s, it contains only 7 characters

Is your source file saved as utf-8? File -> Advanced Save Options -> Encoding -> Unicode (UTF-8 with signature)

Oct 2, 2018 at 7:27pm

I saved it with Unicode UTF8 encoding to no avail. Problem persists. Input string is still 7 chars long.

Oct 2, 2018 at 7:48pm

You really shouldn't be using non-ASCII characters in sources, for the reasons you can see in this thread. If your code assumes specific binary values, just write those values directly!

1
2

std::string s = "Hello \xC2\xA2";
std::wstring ws = L"Hello \xA2";

Time spent dealing with character encoding of source code is time wasted; this is a solved problem.

Oct 2, 2018 at 8:12pm

But it is not source code that we are encoding ... I am just looking for how to transform a string into a wstring and a wstring into a string.
and, I am expecting that the strings that are going to be translated from string to wstring are going to contain the cent symbol (\xA2), but when trying to translate it,that character throws a bad conversion range_error exception!

Oct 2, 2018 at 8:46pm

I am expecting that the strings that are going to be translated from string to wstring are going to contain the cent symbol (\xA2)

You're mixing different things and treating them as equivalent, that's why this isn't making sense to you.

Consider these two byte sequences (I'm purposely not using the word "strings", here):

1
2

char seq1[] = { 'H', 'e', 'l', 'l', 'o', ' ', 0xA2 };
char seq2[] = { 'H', 'e', 'l', 'l', 'o', ' ', 0xC2, 0xA2 };

1. They're both different encodings of the string "Hello ¢".
2. seq1 is a valid encoding of that string in ISO/IEC 8859-1 (a.k.a. utf8mb4).
3. seq2 is a valid encoding of that string in UTF-8.
4. seq1. is an invalid UTF-8 sequence. ~~No UTF-8 sequence ends in a byte whose most significant bit is set.~~ See below.
5. seq2 is also a valid encoding in ISO/IEC 8859-1, but it encodes the string "Hello Â¢" instead.

Your program is going to accept byte sequences (not strings) and it's going to decode them into wide strings. What encoding are you going to assume those byte sequences are in? Latin1, UTF-8, or something else?

Last edited on Oct 2, 2018 at 9:22pm

Oct 2, 2018 at 9:17pm

doug4 (1538)

I've been following along right up to this point:

1
2

char seq1[] = { 'H', 'e', 'l', 'l', 'o', ' ', 0xA2 };
char seq2[] = { 'H', 'e', 'l', 'l', 'o', ' ', 0xC2, 0xA2 };

3. seq2 is a valid encoding of that string in UTF-8.
4. seq1. is an invalid UTF-8 sequence. No UTF-8 sequence ends in a byte whose most significant bit is set.

I am missing something because both seq1 and seq2 end in the same byte. Why is seq2 valid but not seq1?

Oct 2, 2018 at 9:22pm

Sorry, I messed up the condition.

UTF-8 decoding works as a state machine.

while (!input.empty()){
    byte b = input.pop();
    if (b < 0x80)
        output.push(b); // b is just an ASCII character.
    if (input.empty())
        throw InvalidInput(); // This is the check that seq2 fails.
    byte second_byte_in_multibyte_sequence = input.pop();
    // (Further decoding logic omitted.)
}

A valid UTF-8 sequence is composed of multiple multi-byte subsequences strung together. Decoding happens on a subsequence-by-subsequence basis, such that you can always successfully decode a UTF-8 sequence from the middle, as long as you start from the start of a multi-byte subsequence. A multibyte subsequence can start with an ASCII byte, in which case its length is 1. If the byte is non-ASCII, the length is strictly greater than 1, and it's encoded using the number of most significant bits that are turned on.
So {0xC2, 0xA2} and {0xC2, 0xA2, 0xC2, 0xA2} are valid, but {0xC2}, {0xA2}, {0xA2, 0xC2, 0xA2}, {0xC2, 0xA2, 0xA2} are not.

Last edited on Oct 2, 2018 at 9:31pm

Oct 2, 2018 at 10:38pm

Time spent dealing with character encoding of source code is time wasted; this is a solved problem.

...and the solution is not using Windows. But they'll get it one day, the baby steps they took in the last couple years give hope.

Oct 2, 2018 at 10:54pm