std::string character encoding problem

Dec 23, 2015 at 4:07pm
Hey all!

1
2
std::string arrWords[10];
std::vector<std::string> hElemanlar;


......

this->hElemanlar.push_back(std::string(1, this->arrWords[sayKelime][j]));

......

What i am doing is: Every element of arrWord is a std::string. I get
the n th element of arrWord and then push them into hElemanlar.

Assuming arrWords[0] is "test", then:

1
2
3
4
this->hElemanlar.push_back("t");
this->hElemanlar.push_back("e");
this->hElemanlar.push_back("s");
this->hElemanlar.push_back("t");


And my problem is although i have no encoding problems with arrWords, some utf-8 characters are not printed or treated well in hElemanlar.
How can i fix it?
Dec 23, 2015 at 5:57pm
The problem is that UTF-8 uses more than one byte for many characters.

The character ö is stored as two bytes (0xC3, 0xB6). Using your method you will split them so that they are stored as two elements in hElemanlar but that will not display correctly because it's no longer valid UTF-8. You need to keep them together as one element in hElemanlar.

https://en.wikipedia.org/wiki/UTF-8#Description
Last edited on Dec 23, 2015 at 5:59pm
Dec 23, 2015 at 6:05pm
Thank you for your response. Can you guide me how to split such characters?
Last edited on Dec 23, 2015 at 6:06pm
Dec 23, 2015 at 6:36pm
Here's a couple ways:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <string>
#include <iostream>
#include <vector>
#include <locale>
#include <codecvt>
#include <clocale>

int main()
{
    std::string arrWords[10] = {"ひらがな", "カタカナ"}; 

    // fully portable, using C++11 library
    {
        std::vector<std::string> hElemanlar;
        
        std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cvt;
        for(char32_t c: cvt.from_bytes(arrWords[0]))
            hElemanlar.push_back(cvt.to_bytes(c));
            
        std::cout << "Printing arrWords[0] from hElemanlar...\n";
        for(std::string& c: hElemanlar)
            std::cout << c << '\n';
    }


    // portable except to Windows, using the C library
    {
        std::vector<std::string> hElemanlar;
        
        std::setlocale(LC_ALL, "en_US.utf8"); // any utf-8 locale works
        std::mbstate_t mb{};
        int len;
        for(const char* p = &arrWords[1][0], *end = p + arrWords[1].size(); p < end; p += len ) {
            len = std::mbrlen(p, end - p, &mb);
            if(len < 0) break;
            hElemanlar.emplace_back(p, p + len);
        }

        std::cout << "Printing arrWords[1] from hElemanlar...\n";
        for(std::string& c: hElemanlar)
            std::cout << c << '\n';
    }
}


this gives

Printing arrWords[0] from hElemanlar...
ひ
ら
が
な
Printing arrWords[1] from hElemanlar...
カ
タ
カ
ナ


demo: http://coliru.stacked-crooked.com/a/b7c6f11e42a43b62
Last edited on Dec 23, 2015 at 6:40pm
Dec 23, 2015 at 8:27pm
Seems like codecvt isn't supported by GCC. Thank you very much anyways.
Dec 23, 2015 at 8:41pm
It is supported by gcc as of version 5.0 (the demo link above uses gcc). Visual Studio and clang's libc++ had it since about 2010.

Lacking both C++11 and non-Windows C, there are quite a few libraries for Unicode.
Last edited on Dec 23, 2015 at 8:45pm
Dec 23, 2015 at 9:17pm
Thank you very much. I got it working.
Dec 26, 2015 at 11:38am
Very strange... I write c++ in cocos2d-x. When i test the code mentioned above, it works in Samsung Galaxy S4, Sony Experia M4 Aqua. But fails in Samsung Galaxy S3 and Samsung Tablet SM-T113 so far...
Topic archived. No new replies allowed.