wstring could not print ISO8859-2 charac

Forum

Forum
UNIX/Linux Programming
wstring could not print ISO8859-2 charac

wstring could not print ISO8859-2 characters on Linux

I could not get ISO8859-2 characters written in file or printed out through wcout. I am using following command line on Ubuntu:
g++ -g -std=c++11 -finput-charset="ISO8859-2" -fwide-exec-charset="UTF-8"

int main ()
{

std::wstring inttstchar = L"ČčŽž";
std::wcout << inttstchar << std::endl;
std::wofstream myfile("testwstring.txt");
myfile << inttstchar << std::endl;
myfile.close();

return 0;
}
If I take out -finput-charset="ISO8859-2" -fwide-exec-charset="UTF-8" and put:
L"AJAX", everything works fine. What is wrong with my procedure. How to print "ISO8859-2 character set on Ubuntu?
maybe I have to put some additional G++ options?

Last edited on

jlb (4973)

Have you tried to just use the narrow streams and narrow string?

This seems to work for me:

#include <iostream>
#include <fstream>
#include <string>


int main()
{
    std::string inttstchar = "ČčŽž";
    std::cout << inttstchar << std::endl;
    std::ofstream myfile("testwstring.txt");
    myfile << inttstchar << std::endl;

    return 0;
}

Output:ČčŽž

Pereubu (11)

That works for me as well. The point is: I am testing wstring, wcout and wofstream. I do not have to setlocale, but all to be set under complier options.
Put it simply make wstring work under Linux. At least to have something written in a file.

Cubbi (4774)

I am testing wstring, wcout and wofstream.. I do not have to setlocale, but all to be set under complier options.

there are no compiler options that would imbue your wofstream instance with the codecvt facet from the cs_CZ.iso88592 (or whatever) locale.

The input-charset option only controls what encoding the compiler assumes the file is saved as (how did you even save it in 8859-2?), the wide-exec-charset option decides what to put in wchar_t (UTF-8 is not a sensible choice). Neither option writes code for you.

make wstring work under Linux. At least to have something written in a file.

#include <locale>
#include <clocale>
#include <fstream>
#include <iostream>
int main ()
{
    std::wstring inttstchar = L"ČčŽž";
    std::wcout.imbue(std::locale("en_US.utf8")); // this isn't 1994 anymore, use Unicode
    std::setlocale(LC_ALL, "en_US.utf8"); // since we did not sync_with_stdio(false)
    std::wcout << inttstchar << '\n';

    std::wofstream myfile("testwstring.txt");
    myfile.imbue(std::locale("en_US.utf8")); // "cs_CZ.iso88592" if you must
    myfile << inttstchar << '\n';
}

Last edited on

Pereubu (11)

The charset ISO8859-2 relates to eastern European special characters. I put it in notepad++ and save it as UTF-8. The reason why I stick with that charset is:

I need to include wcslen(wtext) to get number of characters. I tried to put that charset as an option for:
-fexec-charset="ISO8859-2" but it would not work. I realise that I have to put:

-finput-charset="UTF-8". The reason why I did not use imbue with local charset ( en_US.utf8 is my locale LC_ALL), that if try to mix cout with wcout, the second one whether cout ot wcout would not work as expected, unless the command freopen is be used to reset first output.

I want to ask why are you using <clocale>? Is the reason that you wcout setting has to be global? Why is that?

You wrote: "UTF-8 is not a sensible choice" as matter of fact I trying to port win32 application with like wide-exec-charset as UTF-16 to Linux. I know that Linux should have UTF-32. Is it possible to use UTF-16 in Linux for wide characters?

Cubbi (4774)

save it as UTF-8

That means you are not using ISO 8859-2. There is no reason to bring it up.

To make things clear, the character sequence "ČčŽž" is encoded as follows:
Unicode: U+010C U+010D U+017D U+017E
UTF-8: 0xC4 0x8C 0xC4 0x8D 0xC5 0xBD 0xC5 0xBE
ISO 8859-2 0xC8 0xE8 0xAE 0xBE
Windows-1257: 0xC8 0xE8 0xDE 0xFE
IBM CP-775: 0x86 0xD1 0xCF 0xD8
etc. there was a multitude of code pages before UTF-8. Unless you have a file that holds 0xC8 0xE8 0xAE 0xBE as opposed to any of the other possible representations of this string, ISO 8859-2 is as irrelevant as CP755

I need to include wcslen(wtext) to get number of characters

wcslen(L"ČčŽž") is 4, regardless of locale, OS, or compiler settings.
If you're trying to apply it to text you've read from that UTF-8 file, your options are:
1. read as-is with std::ifstream, then
1.1 C-style: setlocale(LC_ALL, "en_US.utf8") and mbstowcs(NULL, s.c_str(), s.size()) (Not available on Windows)
1.2 C++11-style: wstring_convert with codecvt_utf8 to make a wstring, then just call its member size() (fully portable, Linux and Windows alike)
2. C++98-style: open with std::wifstream, imbue that with a utf8 locale, and read into a wstring. Then use member size() (Not available on Windows)

if try to mix cout with wcout, the second one whether cout ot wcout would not work as expected, unless the command freopen is be used to reset first output.

Yes, that's an unfortunate problem of the C I/O: once you write a narrow or a wide character to stdout, it is locked in that mode until freopen'd. And std::cout/std::wcout both use C's stdout by default. So pick a mode and stick with in: outside Windows, I very strongly prefer UTF-8 (std::cout, not std::wcout). Windows doesn't support it, so there you're stuck with std::wcout or WinAPI. Some implementations make cout and wcout work simultaneously if you std::ios::sync_with_stdio(false), but not gcc.

why are you using <clocale>?

That's another problem caused by std::cout/std::wcout both using C's stdout by default. You have to call std::setlocale to make C I/O layer used by std::wcout work (when using gcc, at least, some other implementations make it work somehow). You don't need this if you're only using std::cout (such as when doing UTF-8 I/O) or if you're only using files and not console. In gcc, setlocale is not needed to use wcout if you std::ios::sync_with_stdio(false); (but you still need wcout.imbue)

I trying to port win32 application with like wide-exec-charset as UTF-16 to Linux. I know that Linux should have UTF-32

Windows does not use UTF-16 for its execution character set. It uses what used to be called "UCS2" (it was removed from Unicode standard), which refers to the 16-bit subset of Unicode. Any UCS2 code is also valid Unicode with the same meaning. This means any character Windows can handle, Linux can do as well: it's porting in the opposite direction that's hard.

In short, there is no need to change gcc's defaults in your case. Don't change input-charset or wide-exec-charset.

Last edited on

Pereubu (11)

Cubbi,
I would like to thank you on your excellent response. It was of great help.
Excellent! jlb as well is very knowledgeable and helpful. Thank you again.

Last edited on

Topic archived. No new replies allowed.