Problem with wstring output

Pages: 12
Hi,
So, I have a text file in UTF-8 containing cyrillic text. I want to copy it to another file and print it in console. Copying is OK, but on the output on the screen contains all wrong symbols. Here's the code

#include<fstream>
#include<iostream>
#include <string>
using namespace std;
int main() {
wifstream infile("test.txt");
wofstream outfile("out.txt");
wstring in_string;
while(!infile.eof()) {
infile>>in_string;
outfile<<in_string<<" ";
wcout<<in_string<<endl;
in_string.clear();
}
return 0;
}



Help!
C++ has no interfaces for reading UTF-8.
Well, by default console supports ASCII only. There is something called code page, though. The only way to change it I know is for Windows only (http://msdn.microsoft.com/en-us/library/ms686013(VS.85).aspx).
So, at first, I am to convert utf-8 to unicode, right?
Take a look at libiconv, which can convert between many encodings and code pages.
Thanks!
Yes, convert it to Unicode. Each Unicode character corresponds to each wchar_t in the wstring.
Each Unicode character corresponds to each wchar_t in the wstring.


Almost. wchar_t isn't wide enough to hold codepoints above U+FFFF, so in that case you'd need 2 wchar_ts. Though granted those codepoints are very very rarely used.

From what I've seen... WinAPI treats wchar_t as UTF-16. The C++ standard library, however, it ambivilant / clueless as to the existance of Unicode. I remember trying to help someone else output Unicode to the console and it was a huge pain. Ultimately, the easiest way to do it was with WinAPI calls (ie: not using the cout/wcout at all)


EDIT:

Also -- reading UTF-8 isn't difficult. It's not something you need a whole library for.

I can throw together a function for you when I get home from work, but my break time is almost up so I don't have time now.
Last edited on
Almost. wchar_t isn't wide enough to hold codepoints above U+FFFF
Sometimes it is.

From what I've seen... WinAPI treats wchar_t as UTF-16.
UCS-2, to be more accurate.
Sometimes it is.


True. I was assuming Windows.

UCS-2, to be more accurate.


No, actually, UTF-16. Surrogate Pairs and everything show up properly in WinAPI calls (SetWindowText, OPENFILENAME, etc)
Is that so...? Huh. I had no idea.
Here. Note the following:

- it doesn't look for a null termination character, it just reads until EOF or some other file error
- it has minimal error checking
- I didn't actually test it
- it assumes wchar_t is 16-bits wide
- it decodes to UTF-16 surrogate pairs for codepoints above U+FFFF (4-byte codes)

Here's the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
wstring ReadUTF8(istream& in)
{
    static const wchar_t badchar = 0xFFFD;  // U+FFFD or '?' is typical
                                            // I prefer U+FFFD because

    wstring ret;
    unsigned char c;
    wchar_t w;
    unsigned long ww;

    while(true)
    {
        c = in.get();
        if(c < 0x80)        // 1 byte
            w = c;
        else if(c < 0xC0)   // continuation -- invalid as a first byte
            w = badchar;
        else if(c < 0xE0)   // 2 bytes
        {
            w  = (       c & 0x1F) << 6;
            w |= (in.get() & 0x3F);
        }
        else if(c < 0xF0)   // 3 bytes
        {
            w  = (       c & 0x0F) << 12;
            w |= (in.get() & 0x3F) <<  6;
            w |= (in.get() & 0x3F);
        }
        else if(c < 0xF8)   // 4 bytes
        {
            ww = (       c & 0x07) << 18;
            ww|= (in.get() & 0x3F) << 12;
            ww|= (in.get() & 0x3F) <<  6;
            ww|= (in.get() & 0x3F);

            ww -= 0x10000;
            ww &= 0xFFFFF;
            ret.push_back( (ww >> 10) | 0xD800 );       // not ideal to push here...but meh
            w = (ww & 0x03FF) | 0xDC00;
        }
        else                // invalid
            w = badchar;

        // EOF?  Other kind of error?
        if(!in.good())
            break;

        ret.push_back(w);
    }

    return ret;
}
I can do better.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#include <cwchar>
#include <string>

#define BOM8A ((uchar)0xEF)
#define BOM8B ((uchar)0xBB)
#define BOM8C ((uchar)0xBF)

typedef unsigned char uchar;
typedef unsigned long ulong;

void UTF8_WC(wchar_t *dst,const uchar *src,ulong srcl){
    for (ulong a=0;a<srcl;a++){
        uchar byte=*src++;
        wchar_t c=0;
        if (!(byte&0x80))
            c=byte;
        else if ((byte&0xC0)==0x80)
            continue;
        else if ((byte&0xE0)==0xC0){
            c=byte&0x1F;
            c<<=6;
            c|=*src&0x3F;
        }else if ((byte&0xF0)==0xE0){
            c=byte&0x0F;
            c<<=6;
            c|=*src&0x3F;
            c<<=6;
            c|=src[1]&0x3F;
        }else if ((byte&0xF8)==0xF0){
#if WCHAR_MAX==0xFFFF    //<-- I don't have much trust in this directive. Neither should you.
            c='?';
#else
            c=byte&0x07;
            c<<=6;
            c|=*src&0x3F;
            c<<=6;
            c|=src[1]&0x3F;
            c<<=6;
            c|=src[2]&0x3F;
#endif
        }
        *dst++=c;
    }
}

std::wstring UniFromUTF8(const std::string &str){
    ulong start=0;
    if (str.size()>=3 && (uchar)str[0]==BOM8A && (uchar)str[1]==BOM8B && (uchar)str[2]==BOM8C)
        start+=3;
    const uchar *str2=(const uchar *)&str[0]+start;
    ulong size=0;
    for (ulong a=start,end=str.size();a<end;a++,str2++)
        if (*str2<0x80 || (*str2&0xC0)==0xC0)
            size++;
    std::wstring res;
    res.resize(size);
    str2=(const uchar *)&str[0]+start;
    UTF8_WC(&res[0],str2,str.size()-start);
    return res;
}
Last edited on
yeah....

well....


I posted mine first. so I win.

(_|_)
No, I posted mine first, so I win. ;-]

http://www.cplusplus.com/forum/beginner/7233/#msg33495

C++ makes clean code easy. :-)


I've previously posted elsewhere, including the link I just gave, that wchar_t tends to be 32 bits on modern compilers. I was mistaken.

It tends to be 32 bits on POSIX systems.
It tends to be 16 bits on Windows systems (which uses UTF-16 internally).
It can be as small as 8 bits.

Morale: make sure your compiler uses the proper size for wchar_t.

Hope this helps.
So you want to play that game, eh?
http://www.cplusplus.com/forum/general/7142/#msg33079
LOL. helios wins! :-)

(I like my design better, though.)
nice helios! i'll bookmark that thread in case i need it later.. thanks..
I like mine better, but I will steal your loops.
Contrary to what I was expecting, this version and the last one are just as fast (tested with 50 million characters).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
void UTF8_WC(wchar_t *dst,const uchar *src,ulong srcl){
    for (ulong a=0;a<srcl;a++){
        uchar byte=*src++;
        wchar_t c=0;
        if (!(byte&0x80))
            c=byte;
        else{
            ulong size=0,
                mask=0x80;
            c=byte;
            for (;c&mask;mask>>=1)
                size++;
            size--;
            c&=mask-1;
#if WCHAR_MAX==0xFFFF    //<-- I don't have much trust in this directive. Neither should you.
            if (size>2){
                c='?';
                size=0;
            }
#endif
            for (;size;size--,a++){
                c<<=6;
                c|=*++src&0x3F;
            }
        }
        *dst++=c;
    }
}

I don't know which one looks nicer. What do you think?
i n00b like me can't say..
Pages: 12