Hello!
This is my code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
|
#include <vector>
#include <iostream>
#include <fstream>
#include <locale>
#include <conio.h>
#include <Windows.h>
int main() {
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8); // checked in cmd's properties after executing and the codepage is being correctly changed into UTF8 (65001)
std::wifstream wif;
wif.open("utf8.txt");
std::wstring wstr;
std::locale loc(""); // Polish
wif.imbue(loc);
wif >> wstr;
std::wcout << wstr << " ";
wif >> wstr;
std::wcout << wstr << " ";
wif >> wstr;
std::wcin.imbue(loc);
std::wcout.imbue(loc);
std::wcout << wstr << " ";
getline(std::wcin, wstr);
std::wcout << wstr << " ";
std::wcin >> wstr;
}
|
However, the program behaves strangely (the content of the file is also present on the screenshot):
http://www.bankfotek.pl/image/2093166.jpeg
After typing in the same content through wcin, this is what happens:
http://www.bankfotek.pl/image/2093167.jpeg
Could someone explain why the read from the file stops after "zażó", and then mysteriously "eats" the first letter after using wcout?
By extension, how to properly use UTF8 while reading from a file?
----------------------------EDIT------------
I tried changing this portion:
1 2 3 4 5
|
wif >> wstr;
std::wcout << wstr << " ";
wif >> wstr;
std::wcout << wstr << " ";
wif >> wstr;
|
Into
getline(wif, wstr);
and it did correctly read the entire line, however problem with the disappearing first letter after using wcin persists.
Moreover, the problem with wcin, even if imbued with proper locale is that all characters in the string that contain diacritics are converted to blank spaces. So, if I type
zażółć gęślą jaźń and then use wcout to show the input back, it reads:
za g l ja
Another interesting find is that deleting this portion
wif.imbue(loc)
results in the same "halfway-stopped" behaviour. I thought that the console codepage should suffice, why do I need to imbue? Am I doing something wrong?
----------------------------EDIT------------
Again, I tried to investigate the issue. I build this loop:
1 2 3 4 5 6 7 8
|
wchar_t ch;
unsigned i = 0;
while (wif.eof()==false) {
wif.get(ch);
std::wcout << ch;
++i;
}
i;
|
After the loop variable i was at value
30, not the expected 18 (how many letters, whitespaces there are + \0 at the end). What's wrong here? I suppose wchar_t is 2 bytes long and UTF-8 is 4 on this machine and thus treats 2 codepoints as separate chars? This doesn't explain the discrepency though. 18*2 = 36.
And yes, the program, although reading new chars 30x times, stopped showing them on the console after "zażó".
----------------------------EDIT------------
With this code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
SetConsoleOutputCP(CP_UTF8);
SetConsoleCP(CP_UTF8);
std::locale loc("");
std::wifstream wif;
wif.imbue(loc); std::wcout.imbue(loc); std::wcin.imbue(loc);
wchar_t ch;
while (wif.eof()==false) {
wif.get(ch);
std::wcout << ch;
}
while (std::wcin >> ch) {
std::wcout << ch;
}
std::wcout << std::endl;
std::ignore() // pause
|
I was able to properly read from a utf-8 encoded file, but somewhere between processing user console input in
wcin and showing it in
wcout the program fails to deal with diacritics again.
So "zażółć gęślą jaźń" becomes "za g l ja".
I'm also hesitant to use things like getline because I don't know where to imbue them - if I imbue std::wcin and then call getline like this
getline(std::wcin, var);
the result will be properly encoded?
Sincerely.