Read Unicode file and convert to hex

Nov 22, 2011 at 2:14pm
Hi,

I'm trying to read a text file that contains chinese characters (saved in unicode format).

From there I want to convert it into the hex equivalents for each character, encapsulate each hex string in double brackets and write it to another text file.

For example:
Text file one contains:





Text file two should thus read:
<<60A8>>
<<597D>>
<<4E16>>
<<754C>>

Here's the code I've got so far:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <string>
#include <stdio.h>
#include <fstream>
#include <algorithm>


using namespace std;

int main ()
{

	FILE *pfile;
	pfile = fopen ("myfile.txt","w");
	std::wifstream file(L"New Text Document - Copy.txt") ;
	std::wstring line;
	while(getline(file, line))
	{
		for(wstring::size_type n = 0; n < line.size();++n) //start from n=2 to get rid of feff (endian identifier)
		{
			//cout<<hex<<line[n];
			fprintf(pfile,"<<");
			fprintf(pfile,"%X",line[n]);
			fprintf(pfile,">>");
		}
	}
	fclose (pfile);

	return 0;
}


The problem with this is that it outputs the following to a text file:
<<FE>><<FF>><<60>><<A8>><<0>><<D>><<0>><<59>><<7D>><<0>><<D>><<0>><<4E>><<16>><<0>><<D>><<0>><<75>><<4C>>

I know FE and FF are denoting the endianness, and I believe the 0's and D's are null characters and carriage returns, which I would like to eliminate at some point, BUT my main concern is that the hex values for each chinese character have been split into two different parts i.e. <<60A8>> has become <<60>><<A8>>.
Is there a way I can get the hex values to be written to file as I want them to be?

Thanks in advance,

AD
Nov 22, 2011 at 2:27pm
You're trying to read as text a binary file (UTF-16 and UCS-2 are binary formats, even if they are used to represent text), which will never work, of course.
You'll have to open the file as binary, and:
1. Load the BOM into a 16-bit variable to determine the endianness.
2. Load the file into an array of 16-bit values. If the file's endianness doesn't match the native endianness, swap the bytes in each character (x=(x>>8)|(x<<8) [x has to be unsigned for this to work properly]).
3. The array is now a correct array of Unicode codepoints, and you may process it as you like.
Nov 22, 2011 at 3:11pm
hi helios .. sorry for my lack of knowledge ... but i am not able to get you ...
Last edited on Nov 22, 2011 at 3:45pm
Nov 22, 2011 at 3:33pm
Hi helios,

thanks for the quick reply. I've done a little bit of research on opening files as binary, but i'm still a little confused as to how it works. Would you mind expanding on this a little bit?

Sorry, I'm fairly new to c++.

Thanks again,

AD
Nov 22, 2011 at 4:55pm
When you open a file as text, the runtime is free to perform any sort of transformation on the file contents, such as translating newlines to a coherent scheme; all implementations I know of limit themselves to do this, but that's not all the standard allows them to do AFAIK.
When you open a file as binary, the runtime will give you the actual byte values stored in the file. Here's a short example to get you started:
1
2
3
4
5
6
7
8
9
10
std::ifstream file(path,std::ios::binary);
//move the read cursor to the end
file.seekg(0,std::ios::end);
//get the file size
size_t n=file.tellg();
//reset to the beginning
file.seekg(0);
char *buffer=new char[n];
file.read(buffer,n);
file.close();
Nov 22, 2011 at 5:17pm
Thanks helios
Nov 23, 2011 at 3:09pm
Thanks for that helios. I'll give it a go and then try implementing it into my code.
Here's to hoping I can get it working!
Topic archived. No new replies allowed.