For Windows and Visual Studio 2010, this code reads and displays a UTF-8 encoded file. If you're using the MinGW version of GCC, you might have a problem as I don't link it fully implements locales (unlike the Linux version.)
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>
#include <cstdio> // for _fileno
#include <io.h> // for _setmode
#include <fcntl.h> // for _O_U16TEXT
usingnamespace std;
void dump_file(const wstring& filePath) {
// A Windows console will only display Unicode special characters if
// the translation mode is set to UTF-16
int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);
// open the file as Unicode, so we can read into wstrings
wifstream ifs(filePath);
// imbue the file with a codecvt_utf8 facet which knows how to
// convert from UTF-8 to UCS2 (the 2-byte part of UTF-16)
// Note this is available in Visual C++ 2010 and later
locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
ifs.imbue(utf8_locale);
// Skip the BOM (this gets translated from the UTF-8 to the
// UTF-16 version so will be a single character.)
wchar_t bom = L'\0';
ifs.get(bom);
// Read the file contents and write to wcout
wstring line;
while(getline(ifs, line)) {
wcout << line << endl;
}
// put the tranlation mode back to normal
_setmode(_fileno(stdout), oldMode);
cout << endl;
}
int main() {
wstring filePath = L"limerick.txt";
dump_file(filePath);
return 0;
}
Where limerick.txt is a UTF-8 text file containing
En limerick skal være på fem linjer, hvor første,
andre og femte linje har samme enderim og består
av tre verseføtter. Tredje og fjerde er kortere
med to verseføtter, og de deler enderim.
(which is also displayed correctly by the console.)
Nice! It works perfectly fine reading from file now. My only problem now is to write this to a new file :P When I try that, it stops writing to file as soon as it hit's the first letter of the kind 'æ, 'ø' 'å' etc....
No, I think I want to read a UTF-16 and the write it out as UTF-16. I want the program to be able to handle all signs in the document. Including signs like ❤.
Not sure what type of document I have, how do I figure it out?
If my previous code worked, your file is UTF-8 -- the 'Ø' and 'Ã¥' are UTF-8's way of handling 'Ø' and å
If you're using Windows, which I presume you are, open the text file with notepad the then do "Save As". The encoding the file is using will be displayed in the combobox at the bottom of the dialog.
Alternatively, open the text file with a hex viewer:
- a normal Windows text file (extended ASCII) file will use one byte per character, including 'Ø' and 'å'
- a UTF-8 file will use one byte per normal character but two for (e.g.) 'Ø' and 'å', and should begin with the Byte Order Mark (in hex) EF BB BF
- a little-endian UTF-16 file will use two bytes per character and should begin with the Byte Order Mark (in hex) FF FE