Special characters using fstream

Pages: 12
Jul 24, 2013 at 7:39am
I'm having some problems with special norwegian letters. The following code works

1
2
3
4
5
6
7
8
9
10
11
12
#include <iostream>
#include <locale.h>
#include <fstream>
#include <string>

using namespace std;

int main(){
    setlocale(LC_ALL, "norwegian");
    cout << "æøå" << endl;
    return 0;
}


but when I try to read from file using fstream, 'Ø' turns into 'Ø', å turns into 'Ã¥', and so on.

How do I fix this?
Last edited on Jul 24, 2013 at 7:40am
Jul 24, 2013 at 8:06am
If in has type std::ifstream then try to use


in.imbue( std::locale() );

before reading all other data.

Jul 24, 2013 at 8:24am
Thank you for responce, but it did not work.
Jul 24, 2013 at 8:37am
Use

in.imbue( std::locale( "norwegian" ) );
Jul 24, 2013 at 8:53am
That did not work either. Does it work for you? If so, can you give me an examplecode?
Jul 24, 2013 at 10:07am
What o/s are you using? And what compiler?

Also, I assume your file is UTF-8 encoded...

For Windows and Visual Studio 2010, this code reads and displays a UTF-8 encoded file. If you're using the MinGW version of GCC, you might have a problem as I don't link it fully implements locales (unlike the Linux version.)

Andy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>  // for _fileno
#include <io.h>    // for _setmode
#include <fcntl.h> // for _O_U16TEXT

using namespace std;

void dump_file(const wstring& filePath) {
	// A Windows console will only display Unicode special characters if
	// the translation mode is set to UTF-16
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	// open the file as Unicode, so we can read into wstrings
	wifstream ifs(filePath);

	// imbue the file with a codecvt_utf8 facet which knows how to
	// convert from UTF-8 to UCS2 (the 2-byte part of UTF-16)
	// Note this is available in Visual C++ 2010 and later
	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	ifs.imbue(utf8_locale); 

	// Skip the BOM (this gets translated from the UTF-8 to the
	// UTF-16 version so will be a single character.)
	wchar_t bom = L'\0';
	ifs.get(bom);

	// Read the file contents and write to wcout
	wstring line;
	while(getline(ifs, line)) {
		wcout << line << endl;
	}

	// put the tranlation mode back to normal
	_setmode(_fileno(stdout), oldMode);

	cout << endl;
}

int main() {
	wstring filePath = L"limerick.txt";
	dump_file(filePath);
	return 0;
}


Where limerick.txt is a UTF-8 text file containing

En limerick skal være på fem linjer, hvor første,
andre og femte linje har samme enderim og består
av tre verseføtter. Tredje og fjerde er kortere
med to verseføtter, og de deler enderim.

(which is also displayed correctly by the console.)
Last edited on Jul 24, 2013 at 10:11am
Jul 24, 2013 at 10:50am
Nice! It works perfectly fine reading from file now. My only problem now is to write this to a new file :P When I try that, it stops writing to file as soon as it hit's the first letter of the kind 'æ, 'ø' 'å' etc....
Jul 24, 2013 at 10:52am
How are you trying to write to the file?

(Posting minimal but complete code would prob be most helpful here.)

Andy
Last edited on Jul 24, 2013 at 10:53am
Jul 24, 2013 at 10:54am
Never mind, I forgot to write
input.get(bom);
on the second place in my code. Silly me :)

Thanks for all help!
Jul 24, 2013 at 11:26am
:-)

I take it your o/p file ended up UTF-8, as you hoped??

Andy
Last edited on Jul 24, 2013 at 11:26am
Jul 24, 2013 at 2:00pm
Yes, but now I discovered a new bug. It wont read signs like "–", like it did before. What happened?

This is my code so far

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	wifstream input(L"LaTeXHeader.txt");
	wofstream output(L"messagesConverted.txt");

	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	input.imbue(utf8_locale); 

	wchar_t bom = L'\0';
	input.get(bom);

	wstring line;
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();	
	_setmode(_fileno(stdout), oldMode);
	input.open(L"messages.txt");

	input.get(bom);
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();
	output.close();
	_setmode(_fileno(stdout), oldMode);
	
	return 0;
}
Last edited on Jul 24, 2013 at 2:11pm
Jul 24, 2013 at 3:23pm
Seems like I found a solution, without knowing what it was :P
Jul 24, 2013 at 3:37pm
I'll have a look a bit later...

But what are you hoping to do? Read a UTF-8 file in and then write it out as UTF-16 ??

Or what?

Andy
Last edited on Jul 24, 2013 at 3:38pm
Jul 24, 2013 at 3:57pm
No, I think I want to read a UTF-16 and the write it out as UTF-16. I want the program to be able to handle all signs in the document. Including signs like ❤.

Not sure what type of document I have, how do I figure it out?
Jul 24, 2013 at 4:09pm
If my previous code worked, your file is UTF-8 -- the 'Ø' and 'Ã¥' are UTF-8's way of handling 'Ø' and å

If you're using Windows, which I presume you are, open the text file with notepad the then do "Save As". The encoding the file is using will be displayed in the combobox at the bottom of the dialog.

Alternatively, open the text file with a hex viewer:
- a normal Windows text file (extended ASCII) file will use one byte per character, including 'Ø' and 'å'
- a UTF-8 file will use one byte per normal character but two for (e.g.) 'Ø' and 'å', and should begin with the Byte Order Mark (in hex) EF BB BF
- a little-endian UTF-16 file will use two bytes per character and should begin with the Byte Order Mark (in hex) FF FE

Andy
Last edited on Jul 24, 2013 at 4:09pm
Jul 24, 2013 at 5:45pm
Right, my file is UTF-8. Is it then impossible to get signs like "❤" then?
Jul 24, 2013 at 7:45pm
Is it then impossible to get signs like "❤" then?

UTF-16 can deal with them, too.

Andy
Jul 24, 2013 at 8:12pm
Repaired version #1 -- which reads and writes UTF-8

The repair was to imbue the output, as well as the input, and write the BOM to the file.

Andy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	wifstream input(L"LaTeXHeader.txt");
	wofstream output(L"messagesConverted.txt");

	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	input.imbue(utf8_locale); 
	output.imbue(utf8_locale); // Also imbue output

	wchar_t bom = L'\0';
	input.get(bom);

	output << L'\xFEFF'; // write BOM

	wstring line;
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();

	// don't reset mode till later
	//_setmode(_fileno(stdout), oldMode);

	input.open(L"messages.txt");
	input.get(bom);
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}

	input.close();
	output.close();

	_setmode(_fileno(stdout), oldMode);

	return 0;
}
Last edited on Jul 24, 2013 at 8:12pm
Jul 24, 2013 at 8:28pm
And this version reads and write Unicode (little-endian)

Andy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#include <iostream>
#include <fstream>
#include <string>
//#include <locale>
//#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	FILE* fp_in  = _wfopen(L"LaTeXHeader.txt", L"r,ccs=UNICODE");
	FILE* fp_out = _wfopen(L"messagesConverted.txt", L"w,ccs=UNICODE");

	wifstream input_1(fp_in);
	wofstream output(fp_out);

	// For UTF-16, don't imbue
	//locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	//input.imbue(utf8_locale); 
	//output.imbue(utf8_locale); // Also imbue output

	// BOM handled automatically
	//wchar_t bom = L'\0';
	//input_1.get(bom);

	// BOM handled automatically
	//output << L'\xFEFF'; // write BOM

	wstring line;
	
	while (!input_1.eof()){
		getline(input_1, line);
		output << line << endl;
	}
	//input.close();
	fclose(fp_in);

	// don't reset mode till later
	//_setmode(_fileno(stdout), oldMode);

	//input.open(L"messages.txt");
	fp_in  = _wfopen(L"messages.txt", L"r,ccs=UNICODE");
	wifstream input_2(fp_in);
	// BOM handled automatically
	//input_2.get(bom);
	
	while (!input_2.eof()){
		getline(input_2, line);
		output << line << endl;
	}

	//input.close();
	fclose(fp_in);
	output.close();

	_setmode(_fileno(stdout), oldMode);

	return 0;
}
Jul 24, 2013 at 8:31pm
PS This

1
2
3
4
5
6
	wstring line;

	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}


is better written as

1
2
3
4
5
	wstring line;
	
	while (getline(input, line)) {
		output << line << endl;
	}


Andy
Pages: 12