Reading unicode characters

Hi, I trying to read one by one character from Unicode (utf-8) file, but I don't know how to read just one character.
So can you tell me what is the easiest way to read a single character?

EDIT: When I said character, i meant a letter
Last edited on
Read a wchar_t.

You can also use std::wstring for a unicode version of std::string.
(Proof: http://www.cplusplus.com/reference/string/)

Also, for unicode character literals, do this:
std::wstring MyUnicodeString = "This is in unicode!"L; <-- Note the L at the end
Last edited on
LB, that won't read a UTF-8 character as UFT-8 is a multibyte character set.
If your *platform* has UTF-8 support, there is nothing special to do, just open you UTF-8 file as a wide character stream:

1
2
3
4
5
6
7
8
9
10
#include <fstream>
#include <iostream>
#include <locale>
int main()
{
    std::locale::global(std::locale("")); // activate user-preferred locale, in my case en_US.utf8
    std::wifstream wf("test.txt"); // test.txt contains utf-8 text
    for(wchar_t c; wf.get(c); )
        std::wcout << "Processed character " << c << '\n';
}
tested with GNU gcc 4.6.2 on linux

If your compiler has sufficient C++11 support, you can use the locale-independent Unicode facilities, (not yet supported by GCC, but supported by Clang and Visual Studio)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <fstream>
#include <codecvt>
#include <iostream>
#include <clocale>
#ifdef _WIN32
#include <fcntl.h>
#include <io.h>
#endif
int main()
{
    std::ifstream f("test.txt");
    std::wbuffer_convert<std::codecvt_utf8<wchar_t>> conv(f.rdbuf());
    std::wistream wf(&conv);

#ifdef _WIN32
    _setmode(_fileno(stdout), _O_WTEXT);
#else
    std::setlocale(LC_ALL, "");
#endif 
    for(wchar_t c; wf.get(c); )
        std::wcout << "Processed character " << c << '\n';
}

(tested with clang++3.0 on linux and visual studio 2010 sp1 on windows 7)


Both examples tested on a file that contains the bytes 7a c3 9f e6 b0 b4 f0 9d 84 8b, which represent UTF-8 encoding of the four characters zß水𝄋 (Windows's version of codecvt_utf8, as usual, fails miserably at the 𝄋 which is not representable as UCS2)
Last edited on
Topic archived. No new replies allowed.