Reading unicode characters

Forum

Forum
Beginners
Reading unicode characters

Reading unicode characters

Jan 7, 2012 at 2:42am

Hi, I trying to read one by one character from Unicode (utf-8) file, but I don't know how to read just one character.
So can you tell me what is the easiest way to read a single character?

EDIT: When I said character, i meant a letter

Last edited on Jan 7, 2012 at 11:51am

Jan 7, 2012 at 4:57am

LB (13399)

Read a wchar_t.

You can also use std::wstring for a unicode version of std::string.
(Proof: http://www.cplusplus.com/reference/string/)

Also, for unicode character literals, do this:
std::wstring MyUnicodeString = "This is in unicode!"L; <-- Note the L at the end

Last edited on Jan 7, 2012 at 5:43pm

Jan 7, 2012 at 7:40am

kbw (9488)

LB, that won't read a UTF-8 character as UFT-8 is a multibyte character set.

Jan 7, 2012 at 3:06pm

Cubbi (4774)

If your *platform* has UTF-8 support, there is nothing special to do, just open you UTF-8 file as a wide character stream:

#include <fstream>
#include <iostream>
#include <locale>
int main()
{
    std::locale::global(std::locale("")); // activate user-preferred locale, in my case en_US.utf8
    std::wifstream wf("test.txt"); // test.txt contains utf-8 text
    for(wchar_t c; wf.get(c); )
        std::wcout << "Processed character " << c << '\n';
}

Edit & run on cpp.sh

tested with GNU gcc 4.6.2 on linux

If your compiler has sufficient C++11 support, you can use the locale-independent Unicode facilities, (not yet supported by GCC, but supported by Clang and Visual Studio)

#include <fstream>
#include <codecvt>
#include <iostream>
#include <clocale>
#ifdef _WIN32
#include <fcntl.h>
#include <io.h>
#endif
int main()
{
    std::ifstream f("test.txt");
    std::wbuffer_convert<std::codecvt_utf8<wchar_t>> conv(f.rdbuf());
    std::wistream wf(&conv);

#ifdef _WIN32
    _setmode(_fileno(stdout), _O_WTEXT);
#else
    std::setlocale(LC_ALL, "");
#endif 
    for(wchar_t c; wf.get(c); )
        std::wcout << "Processed character " << c << '\n';
}

Edit & run on cpp.sh

(tested with clang++3.0 on linux and visual studio 2010 sp1 on windows 7)

Both examples tested on a file that contains the bytes 7a c3 9f e6 b0 b4 f0 9d 84 8b, which represent UTF-8 encoding of the four characters zß水𝄋 (Windows's version of codecvt_utf8, as usual, fails miserably at the 𝄋 which is not representable as UCS2)

Last edited on Jan 7, 2012 at 3:10pm

Topic archived. No new replies allowed.