Special characters using fstream

Forum

Forum
Beginners
Special characters using fstream

Special characters using fstream

Pages: 12

Jul 24, 2013 at 7:39am

I'm having some problems with special norwegian letters. The following code works

#include <iostream>
#include <locale.h>
#include <fstream>
#include <string>

using namespace std;

int main(){
    setlocale(LC_ALL, "norwegian");
    cout << "æøå" << endl;
    return 0;
}

Edit & run on cpp.sh

but when I try to read from file using fstream, 'Ø' turns into 'Ã˜', å turns into 'Ã¥', and so on.

How do I fix this?

Last edited on Jul 24, 2013 at 7:40am

Jul 24, 2013 at 8:06am

vlad from moscow (6539)

If in has type std::ifstream then try to use

in.imbue( std::locale() );

before reading all other data.

Jul 24, 2013 at 8:24am

Zetison (10)

Thank you for responce, but it did not work.

Jul 24, 2013 at 8:37am

vlad from moscow (6539)

Use

in.imbue( std::locale( "norwegian" ) );

Jul 24, 2013 at 8:53am

Zetison (10)

That did not work either. Does it work for you? If so, can you give me an examplecode?

Jul 24, 2013 at 10:07am

andywestken (4094)

What o/s are you using? And what compiler?

Also, I assume your file is UTF-8 encoded...

For Windows and Visual Studio 2010, this code reads and displays a UTF-8 encoded file. If you're using the MinGW version of GCC, you might have a problem as I don't link it fully implements locales (unlike the Linux version.)

Andy

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>  // for _fileno
#include <io.h>    // for _setmode
#include <fcntl.h> // for _O_U16TEXT

using namespace std;

void dump_file(const wstring& filePath) {
	// A Windows console will only display Unicode special characters if
	// the translation mode is set to UTF-16
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	// open the file as Unicode, so we can read into wstrings
	wifstream ifs(filePath);

	// imbue the file with a codecvt_utf8 facet which knows how to
	// convert from UTF-8 to UCS2 (the 2-byte part of UTF-16)
	// Note this is available in Visual C++ 2010 and later
	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	ifs.imbue(utf8_locale); 

	// Skip the BOM (this gets translated from the UTF-8 to the
	// UTF-16 version so will be a single character.)
	wchar_t bom = L'\0';
	ifs.get(bom);

	// Read the file contents and write to wcout
	wstring line;
	while(getline(ifs, line)) {
		wcout << line << endl;
	}

	// put the tranlation mode back to normal
	_setmode(_fileno(stdout), oldMode);

	cout << endl;
}

int main() {
	wstring filePath = L"limerick.txt";
	dump_file(filePath);
	return 0;
}

Edit & run on cpp.sh

Where limerick.txt is a UTF-8 text file containing

En limerick skal være på fem linjer, hvor første,
andre og femte linje har samme enderim og består
av tre verseføtter. Tredje og fjerde er kortere
med to verseføtter, og de deler enderim.

(which is also displayed correctly by the console.)

Last edited on Jul 24, 2013 at 10:11am

Jul 24, 2013 at 10:50am

Zetison (10)

Nice! It works perfectly fine reading from file now. My only problem now is to write this to a new file :P When I try that, it stops writing to file as soon as it hit's the first letter of the kind 'æ, 'ø' 'å' etc....

Jul 24, 2013 at 10:52am

andywestken (4094)

How are you trying to write to the file?

(Posting minimal but complete code would prob be most helpful here.)

Andy

Last edited on Jul 24, 2013 at 10:53am

Jul 24, 2013 at 10:54am

Zetison (10)

Never mind, I forgot to write
input.get(bom);
on the second place in my code. Silly me :)

Thanks for all help!

Jul 24, 2013 at 11:26am

andywestken (4094)

:-)

I take it your o/p file ended up UTF-8, as you hoped??

Andy

Last edited on Jul 24, 2013 at 11:26am

Jul 24, 2013 at 2:00pm

Zetison (10)

Yes, but now I discovered a new bug. It wont read signs like "–", like it did before. What happened?

This is my code so far

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	wifstream input(L"LaTeXHeader.txt");
	wofstream output(L"messagesConverted.txt");

	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	input.imbue(utf8_locale); 

	wchar_t bom = L'\0';
	input.get(bom);

	wstring line;
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();	
	_setmode(_fileno(stdout), oldMode);
	input.open(L"messages.txt");

	input.get(bom);
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();
	output.close();
	_setmode(_fileno(stdout), oldMode);
	
	return 0;
}

Edit & run on cpp.sh

Last edited on Jul 24, 2013 at 2:11pm

Jul 24, 2013 at 3:23pm

Zetison (10)

Seems like I found a solution, without knowing what it was :P

Jul 24, 2013 at 3:37pm

andywestken (4094)

I'll have a look a bit later...

But what are you hoping to do? Read a UTF-8 file in and then write it out as UTF-16 ??

Or what?

Andy

Last edited on Jul 24, 2013 at 3:38pm

Jul 24, 2013 at 3:57pm

Zetison (10)

No, I think I want to read a UTF-16 and the write it out as UTF-16. I want the program to be able to handle all signs in the document. Including signs like ❤.

Not sure what type of document I have, how do I figure it out?

Jul 24, 2013 at 4:09pm

andywestken (4094)

If my previous code worked, your file is UTF-8 -- the 'Ã˜' and 'Ã¥' are UTF-8's way of handling 'Ø' and å

If you're using Windows, which I presume you are, open the text file with notepad the then do "Save As". The encoding the file is using will be displayed in the combobox at the bottom of the dialog.

Alternatively, open the text file with a hex viewer:
- a normal Windows text file (extended ASCII) file will use one byte per character, including 'Ø' and 'å'
- a UTF-8 file will use one byte per normal character but two for (e.g.) 'Ø' and 'å', and should begin with the Byte Order Mark (in hex) EF BB BF
- a little-endian UTF-16 file will use two bytes per character and should begin with the Byte Order Mark (in hex) FF FE

Andy

Last edited on Jul 24, 2013 at 4:09pm

Jul 24, 2013 at 5:45pm

Zetison (10)

Right, my file is UTF-8. Is it then impossible to get signs like "❤" then?

Jul 24, 2013 at 7:45pm

andywestken (4094)

Is it then impossible to get signs like "❤" then?

UTF-16 can deal with them, too.

Andy

Jul 24, 2013 at 8:12pm

andywestken (4094)

Repaired version #1 -- which reads and writes UTF-8

The repair was to imbue the output, as well as the input, and write the BOM to the file.

Andy

#include <iostream>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	wifstream input(L"LaTeXHeader.txt");
	wofstream output(L"messagesConverted.txt");

	locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	input.imbue(utf8_locale); 
	output.imbue(utf8_locale); // Also imbue output

	wchar_t bom = L'\0';
	input.get(bom);

	output << L'\xFEFF'; // write BOM

	wstring line;
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}
	input.close();

	// don't reset mode till later
	//_setmode(_fileno(stdout), oldMode);

	input.open(L"messages.txt");
	input.get(bom);
	
	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}

	input.close();
	output.close();

	_setmode(_fileno(stdout), oldMode);

	return 0;
}

Edit & run on cpp.sh

Last edited on Jul 24, 2013 at 8:12pm

Jul 24, 2013 at 8:28pm

andywestken (4094)

And this version reads and write Unicode (little-endian)

Andy

#include <iostream>
#include <fstream>
#include <string>
//#include <locale>
//#include <codecvt>

#include <cstdio>
#include <io.h>
#include <fcntl.h>

using namespace std;

int main(){
	int oldMode = _setmode(_fileno(stdout), _O_U16TEXT);

	FILE* fp_in  = _wfopen(L"LaTeXHeader.txt", L"r,ccs=UNICODE");
	FILE* fp_out = _wfopen(L"messagesConverted.txt", L"w,ccs=UNICODE");

	wifstream input_1(fp_in);
	wofstream output(fp_out);

	// For UTF-16, don't imbue
	//locale utf8_locale(locale(), new codecvt_utf8<wchar_t>);
	//input.imbue(utf8_locale); 
	//output.imbue(utf8_locale); // Also imbue output

	// BOM handled automatically
	//wchar_t bom = L'\0';
	//input_1.get(bom);

	// BOM handled automatically
	//output << L'\xFEFF'; // write BOM

	wstring line;
	
	while (!input_1.eof()){
		getline(input_1, line);
		output << line << endl;
	}
	//input.close();
	fclose(fp_in);

	// don't reset mode till later
	//_setmode(_fileno(stdout), oldMode);

	//input.open(L"messages.txt");
	fp_in  = _wfopen(L"messages.txt", L"r,ccs=UNICODE");
	wifstream input_2(fp_in);
	// BOM handled automatically
	//input_2.get(bom);
	
	while (!input_2.eof()){
		getline(input_2, line);
		output << line << endl;
	}

	//input.close();
	fclose(fp_in);
	output.close();

	_setmode(_fileno(stdout), oldMode);

	return 0;
}

Edit & run on cpp.sh

Jul 24, 2013 at 8:31pm

andywestken (4094)

PS This

	wstring line;

	while (!input.eof()){
		getline(input, line);
		output << line << endl;
	}

is better written as

	wstring line;
	
	while (getline(input, line)) {
		output << line << endl;
	}

Andy

Pages: 12