Wifstream reading UTF-16 files

Hello,

I have created a small function that print a file. It looks like this:

1
2
3
4
5
6
7
8
9
void PrintFile(wstring FileName){
wstring line;
wifstream readfile (FileName.c_str());

while(readfile.good()){
getline(readfile, line);
wcout << line << endl;
}
}


This work fine on normal text files, however I have some specific files where the output is displayed with an extra blank space between every character. If a line contains the value "This is a line" then the output from this function is "T h i s i s a l i n e".

I have been looking at the files in a hex editor and the files that are giving this problems seems to be UTF-16 Unicode files, they are all starting with FF FE. However, googling on "C++ How to read UTF-16 files", gives me such a wide range of explanations that I get more confused then confident on how I should read these files. The BOM of the file is written out as "■" so it looks like the UTF-16 format is not recognised at all. If I save this file as another format such as UTF-8, the code works.

I have looked into codecvt_utf16 but unfortunately my project is using vc90 toolset and this dosent seem to be available there. Could someone please advice on how I should read these files properly?

Application will be Windows only, compiler VC++.

Thanks
closed account (o3hC5Di1)
Hi there,

According to this post:
http://stackoverflow.com/questions/10504044/correctly-reading-a-utf-16-text-file-into-a-string-without-external-libraries

UTF16 files should be opened as binary, example (by Cubbi nonetheless) is provided.

Hope that helps.

All the best,
NwN
Hi,

Thanks for your reply.
If I try to simply add ios::binary to wifstream in the original code there is not change in the output, there is still empty spaces between each character.

The solution from Cubbi in the link uses codecvt_utf16. I tried using this but I think this cannot be used with vc90 toolset? Please correct me if I'm mistaken on this point. If I try to use it right now I get the following compiler errors:
error C2061: syntax error : identifier 'codecvt_utf16'
error C2065: 'little_endian' : undeclared identifier
error C2059: syntax error : ')'

There was also another solution further down in that thread to read data into a stringstream and then convert it to a wstring. I tried to use that as well but there is no output at all when running it.
Last edited on
I think this cannot be used with vc90 toolset

Yes, it is only listed for Visual Studio 2010 and 2012 on MSDN: http://msdn.microsoft.com/en-us/library/ee292208.aspx

As I mention in a comment in the linked post, if you're not looking for portability to non-Windows platforms, open it with _O_U16TEXT

To quote MSDN's http://msdn.microsoft.com/en-us/library/z0kc8e3z.aspx

MSDN wrote:
_O_U16TEXT
Open the file in Unicode UTF-16 mode.


See also the ever-popular blog post http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx
Last edited on
If I use functions like _wopen or _wsopen_s to open in _O_U16TEXT mode, is there an easy way to have get each line of the file into a wstring like getline() does? I'm only finding ways to read the whole content of the files into buffers.

I'm asking because next step for the application would be to check each line for a specific word and if it's present, print the line to the console.
Found fgetws() function that allows me to read one line into a wchar_t array, then I can just append it to a wstring. Here is my code at the moment, any suggestions?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void PrintFile(wstring FileName){
	
	FILE *file;
	wstring line;
		
	_wfopen_s(&file, FileName.c_str(), L"r,ccs=UTF-16LE");
	
	while(!feof(file) && !ferror(file)){
		line = ReadOneLine(file, line);
		wcout << line;
	}

	fclose(file);
}

wstring ReadOneLine(FILE *File, wstring Line){

	wchar_t LineOfChars[512];
	fgetws(LineOfChars, 512, File);

	Line.clear();
	Line.append(LineOfChars);

	return Line;
}
closed account (o3hC5Di1)
Hi there,

If you have an array, perhaps you are able to loop through it and search for a word using the space characters as delimiters?

Alternatively, there are probably some libraries out there allowing you to use regex on C strings, or a way to convert a C string to a string object, and using regex on that.

Hope that helps.

All the best,
NwN
I'm sorry, I can see that I was not very clear in my last post.

The code is printing the file without any spaces right now, thanks to the suggestions given here. Since I get each row using ReadOneLine() into a wstring I can search the wstring for content using the code I use for searching other files:

1
2
3
4
5
6
size_t found;
wstring FindString = L"My Search String";
found = line.find(FindString);

if(found != wstring::npos)
	wcout << "I found my string!";


So, from a functionality perspective I'm satisfied :).

However, since I haven't used _wfopen_s() or fgetws() I was more looking for suggestions if there was some concepts I have overlooked or if there are something in this code that could cause problems further on. Basically checking if the code is good to go.
Topic archived. No new replies allowed.