I'm trying to read a text file that contains chinese characters (saved in unicode format).
From there I want to convert it into the hex equivalents for each character, encapsulate each hex string in double brackets and write it to another text file.
For example:
Text file one contains:
您
好
世
界
Text file two should thus read:
<<60A8>>
<<597D>>
<<4E16>>
<<754C>>
The problem with this is that it outputs the following to a text file:
<<FE>><<FF>><<60>><<A8>><<0>><<D>><<0>><<59>><<7D>><<0>><<D>><<0>><<4E>><<16>><<0>><<D>><<0>><<75>><<4C>>
I know FE and FF are denoting the endianness, and I believe the 0's and D's are null characters and carriage returns, which I would like to eliminate at some point, BUT my main concern is that the hex values for each chinese character have been split into two different parts i.e. <<60A8>> has become <<60>><<A8>>.
Is there a way I can get the hex values to be written to file as I want them to be?
You're trying to read as text a binary file (UTF-16 and UCS-2 are binary formats, even if they are used to represent text), which will never work, of course.
You'll have to open the file as binary, and:
1. Load the BOM into a 16-bit variable to determine the endianness.
2. Load the file into an array of 16-bit values. If the file's endianness doesn't match the native endianness, swap the bytes in each character (x=(x>>8)|(x<<8) [x has to be unsigned for this to work properly]).
3. The array is now a correct array of Unicode codepoints, and you may process it as you like.
thanks for the quick reply. I've done a little bit of research on opening files as binary, but i'm still a little confused as to how it works. Would you mind expanding on this a little bit?
When you open a file as text, the runtime is free to perform any sort of transformation on the file contents, such as translating newlines to a coherent scheme; all implementations I know of limit themselves to do this, but that's not all the standard allows them to do AFAIK.
When you open a file as binary, the runtime will give you the actual byte values stored in the file. Here's a short example to get you started:
1 2 3 4 5 6 7 8 9 10
std::ifstream file(path,std::ios::binary);
//move the read cursor to the end
file.seekg(0,std::ios::end);
//get the file size
size_t n=file.tellg();
//reset to the beginning
file.seekg(0);
char *buffer=newchar[n];
file.read(buffer,n);
file.close();