reading Japanese text from UTF-8 text fi

Forum

Forum
General C++ Programming
reading Japanese text from UTF-8 text fi

reading Japanese text from UTF-8 text file

Nov 9, 2010 at 6:01am

Hi, I've been struggling with this problem.
I google searched a long time but still can't figure this out.
I have a UTF-8 format text file with Japanese phrase in it.
I tried to use wifsream to read the file into wstring but the string holds some garbage information instead of Japanese. Anyone know how to do this?

I am using Win32 API.

"test.txt" contains: <Japanese>これは日本語の文です</Japanese>

wstring wreadinput(wifstream &file) // read all data from file (wchar_t)
{
	wstring str;
	wstring strtemp;
	wchar_t bos[3]; // byte order mark

	file.read(bos, 3); // take bos out
	while(!file.eof())
	{
		getline(file, strtemp);
		str.append(strtemp);
	}
	return str;
}


 ...(in some other function)...
   // read the text file
   wifstream file;
   file.open("test.txt"); 
   if(!file)
   {
      return FALSE;
   }
   wstring str_test;
   str_test = wreadinput(file);

str_test read was: <Japanese>ããã¯æ¥æ¬èªã®æã§ã</Japanese>

Is there something to do with locale?

Last edited on Nov 9, 2010 at 6:32am

Nov 9, 2010 at 6:23am

Disch (13742)

unfortunately, the standard libs are ignorant of Unicode, so this isn't as easy as it should be.

You'll have to manually decode the UTF-8 (or find a lib that does it). Details of UTF-8 are here: http://en.wikipedia.org/wiki/UTF-8#Description

Note that UTF-8 has 8-bit entries, so they're not wide. So reading a wstring isn't going to work (but it might work if the text file is UTF-16). Also, if you're reading a text file, you probably shouldn't be opening it as binary.

lastly -- even if you get it working, printing the string to the user isn't easy on some platforms (Windows console). wcout is effectively totally useless -- you can't just feed it Unicode strings like you might expect.

I don't know if you're outputting to the Windows console, but in the event you are, you might want to read this thread: http://www.cplusplus.com/forum/windows/9797/page3.html#msg46844

Nov 9, 2010 at 6:31am

bluewind (8)

Thanks for the reply.
Yes I shouldn't read as binary, forgot to take that out.
Anyway, I do noticed that output to console is difficult.
So I'm actually using Win32 API.
I shall update my first post.

I read the UTF-8 stuff but still not sure of how to do the decoding in C++.
Is there any example?

Nov 9, 2010 at 7:15am

Disch (13742)

I was bored. Here you go.

Note it's a little long/complicated, but it also validates the string to make sure it's valid UTF-8, and it accounts for all edge cases I could think of. Nothing should trip it up -- should be very sturdy.

std::wstring FromUTF8(const char* str)
{
    const unsigned char* s = reinterpret_cast<const unsigned char*>(str);

    static const wchar_t badchar = '?';

    std::wstring ret;

    unsigned i = 0;
    while(s[i])
    {
        try
        {
            if(s[i] < 0x80)         // 00-7F: 1 byte codepoint
            {
                ret += s[i];
                ++i;
            }
            else if(s[i] < 0xC0)    // 80-BF: invalid for midstream
                throw 0;
            else if(s[i] < 0xE0)    // C0-DF: 2 byte codepoint
            {
                if((s[i+1] & 0xC0) != 0x80)		throw 1;

                ret +=  ((s[i  ] & 0x1F) << 6) |
                        ((s[i+1] & 0x3F));
                i += 2;
            }
            else if(s[i] < 0xF0)    // E0-EF: 3 byte codepoint
            {
                if((s[i+1] & 0xC0) != 0x80)		throw 1;
                if((s[i+2] & 0xC0) != 0x80)		throw 2;

                wchar_t ch = 
                        ((s[i  ] & 0x0F) << 12) |
                        ((s[i+1] & 0x3F) <<  6) |
                        ((s[i+2] & 0x3F));
                i += 3;

                // make sure it isn't a surrogate pair
                if((ch & 0xF800) == 0xD800)
                    ch = badchar;

                ret += ch;
            }
            else if(s[i] < 0xF8)    // F0-F7: 4 byte codepoint
            {
                if((s[i+1] & 0xC0) != 0x80)		throw 1;
                if((s[i+2] & 0xC0) != 0x80)		throw 2;
                if((s[i+3] & 0xC0) != 0x80)		throw 3;

                unsigned long ch = 
                        ((s[i  ] & 0x07) << 18) |
                        ((s[i+1] & 0x3F) << 12) |
                        ((s[i+2] & 0x3F) <<  6) |
                        ((s[i+3] & 0x3F));
                i += 4;

                // make sure it isn't a surrogate pair
                if((ch & 0xFFF800) == 0xD800)
                    ch = badchar;

                if(ch < 0x10000)	// overlong encoding -- but technically possible
                    ret += static_cast<wchar_t>(ch);
                else if(std::numeric_limits<wchar_t>::max() < 0x110000)
                {
                    // wchar_t is too small for 4 byte code point
                    //  encode as UTF-16 surrogate pair

                    ch -= 0x10000;
                    ret += static_cast<wchar_t>( (ch >> 10   ) | 0xD800 );
                    ret += static_cast<wchar_t>( (ch & 0x03FF) | 0xDC00 );
                }
                else
                    ret += static_cast<wchar_t>(ch);
            }
            else                    // F8-FF: invalid
                throw 0;
        }
        catch(int skip)
        {
            if(!skip)
            {
                do
                {
                    ++i;
                }while((s[i] & 0xC0) == 0x80);
            }
            else
                i += skip;
        }
    }

    return ret;
}

Usage:

string utf8;  // note it's a string, not a wstring

utf8 = ReadUTF8FromFile( yourfile );

wstring unicodestring = FromUTF8( utf8.c_str() );

// give unicodestring to WinAPI

EDIT: removed tabs. correcting casting error

EDIT 2: forgot about values F8 and up

Last edited on Nov 9, 2010 at 7:22am

Nov 9, 2010 at 7:48am

bluewind (8)

Oh my god! It works!
I don't understand half of those code, but the conversion was perfect.
It even works with Chinese.
Thank you Disch, you saved this guy in distress.
I've been searching on the Internet for a whole day and wasn't able to find a solution as awesome as this.

Once again thanks.

Nov 9, 2010 at 12:12pm

coder777 (8448)

hello bluewind,

this http://utfcpp.sourceforge.net/ as a lib looks good to me for you purpose

Nov 9, 2010 at 12:49pm

closed account (EzwRko23)

Oh, the almighty Boost doesn't have UTF-8 support? What a pity.

Nov 9, 2010 at 3:11pm

Disch (13742)

Oh, the almighty Boost doesn't have UTF-8 support? What a pity.

Maybe it does. I honestly didn't check.

Nov 9, 2010 at 3:31pm

moorecm (1932)

There might be a Boost.Unicode coming in the near future; I think it was being considered earlier this year.

Nov 9, 2010 at 5:18pm

bluewind (8)

@coder777
Yeah, I did checked out that library, but wasn't sure how to use it.
It only has UTF-8 to UTF-16 and UTF-32 conversion.

Nov 9, 2010 at 6:45pm

Disch (13742)

It only has UTF-8 to UTF-16 and UTF-32 conversion.

FWIW, all my function does is convert UTF-8 to UTF-16 (if wchar_t is 16-bits) or UTF-32 (if wchar_t is larger)

Nov 10, 2010 at 5:27am

bluewind (8)

Ha ha, I see.
I don't have experience with doing Unicode conversion.
Still, your code works fine, so I would like to stick with it.

Topic archived. No new replies allowed.