std::string character encoding problem

Hey all!

1
2
std::string arrWords[10];
std::vector<std::string> hElemanlar;


......

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]));

......

What i am doing is: Every element of arrWord is a std::string. I get
the n th element of arrWord and then push them into hElemanlar.

Assuming arrWords[0] is "test", then:

1
2
3
4
this->hElemanlar.push_back("t");
this->hElemanlar.push_back("e");
this->hElemanlar.push_back("s");
this->hElemanlar.push_back("t");


And my problem is although i have no encoding problems with arrWords, some utf-8 characters are not printed or treated well in hElemanlar.
How can i fix it?
The problem is that UTF-8 uses more than one byte for many characters.

The character ö is stored as two bytes (0xC3, 0xB6). Using your method you will split them so that they are stored as two elements in hElemanlar but that will not display correctly because it's no longer valid UTF-8. You need to keep them together as one element in hElemanlar.

https://en.wikipedia.org/wiki/UTF-8#Description
Last edited on
Thank you for your response. Can you guide me how to split such characters?
Last edited on
Here's a couple ways:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <string>
#include <iostream>
#include <vector>
#include <locale>
#include <codecvt>
#include <clocale>

int main()
{
    std::string arrWords[10] = {"ひらがな", "カタカナ"}; 

    // fully portable, using C++11 library
    {
        std::vector<std::string> hElemanlar;
        
        std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cvt;
        for(char32_t c: cvt.from_bytes(arrWords[0]))
            hElemanlar.push_back(cvt.to_bytes(c));
            
        std::cout << "Printing arrWords[0] from hElemanlar...\n";
        for(std::string& c: hElemanlar)
            std::cout << c << '\n';
    }


    // portable except to Windows, using the C library
    {
        std::vector<std::string> hElemanlar;
        
        std::setlocale(LC_ALL, "en_US.utf8"); // any utf-8 locale works
        std::mbstate_t mb{};
        int len;
        for(const char* p = &arrWords[1][0], *end = p + arrWords[1].size(); p < end; p += len ) {
            len = std::mbrlen(p, end - p, &mb);
            if(len < 0) break;
            hElemanlar.emplace_back(p, p + len);
        }

        std::cout << "Printing arrWords[1] from hElemanlar...\n";
        for(std::string& c: hElemanlar)
            std::cout << c << '\n';
    }
}


this gives

Printing arrWords[0] from hElemanlar...
ひ
ら
が
な
Printing arrWords[1] from hElemanlar...
カ
タ
カ
ナ


demo: http://coliru.stacked-crooked.com/a/b7c6f11e42a43b62
Last edited on
Seems like codecvt isn't supported by GCC. Thank you very much anyways.
It is supported by gcc as of version 5.0 (the demo link above uses gcc). Visual Studio and clang's libc++ had it since about 2010.

Lacking both C++11 and non-Windows C, there are quite a few libraries for Unicode.
Last edited on
Thank you very much. I got it working.
Very strange... I write c++ in cocos2d-x. When i test the code mentioned above, it works in Samsung Galaxy S4, Sony Experia M4 Aqua. But fails in Samsung Galaxy S3 and Samsung Tablet SM-T113 so far...
Topic archived. No new replies allowed.