I'm trying to read a text file that contains chinese characters (saved in unicode format).
From there I want to convert it into the hex equivalents for each character, encapsulate each hex string in double brackets and write it to another text file.
For example:
Text file one contains:
您
好
世
界
Text file two should thus read:
<<60A8>>
<<597D>>
<<4E16>>
<<754C>>
#include <iostream>
#include <string>
#include <stdio.h>
#include <fstream>
#include <algorithm>
usingnamespace std;
int main ()
{
FILE *pfile;
pfile = fopen ("myfile.txt","w");
std::wifstream file(L"New Text Document - Copy.txt") ;
std::wstring line;
while(getline(file, line))
{
for(wstring::size_type n = 0; n < line.size();++n) //start from n=2 to get rid of feff (endian identifier)
{
//cout<<hex<<line[n];
fprintf(pfile,"<<");
fprintf(pfile,"%X",line[n]);
fprintf(pfile,">>");
}
}
fclose (pfile);
return 0;
}
The problem with this is that it outputs the following to a text file:
<<FE>><<FF>><<60>><<A8>><<0>><<D>><<0>><<59>><<7D>><<0>><<D>><<0>><<4E>><<16>><<0>><<D>><<0>><<75>><<4C>>
I know FE and FF are denoting the endianness, and I believe the 0's and D's are null characters and carriage returns, which I would like to eliminate at some point, BUT my main concern is that the hex values for each chinese character have been split into two different parts i.e. <<60A8>> has become <<60>><<A8>>.
Is there a way I can get the hex values to be written to file as I want them to be?
You're trying to read as text a binary file (UTF-16 and UCS-2 are binary formats, even if they are used to represent text), which will never work, of course.
You'll have to open the file as binary, and:
1. Load the BOM into a 16-bit variable to determine the endianness.
2. Load the file into an array of 16-bit values. If the file's endianness doesn't match the native endianness, swap the bytes in each character (x=(x>>8)|(x<<8) [x has to be unsigned for this to work properly]).
3. The array is now a correct array of Unicode codepoints, and you may process it as you like.
thanks for the quick reply. I've done a little bit of research on opening files as binary, but i'm still a little confused as to how it works. Would you mind expanding on this a little bit?
When you open a file as text, the runtime is free to perform any sort of transformation on the file contents, such as translating newlines to a coherent scheme; all implementations I know of limit themselves to do this, but that's not all the standard allows them to do AFAIK.
When you open a file as binary, the runtime will give you the actual byte values stored in the file. Here's a short example to get you started:
1 2 3 4 5 6 7 8 9 10
std::ifstream file(path,std::ios::binary);
//move the read cursor to the end
file.seekg(0,std::ios::end);
//get the file size
size_t n=file.tellg();
//reset to the beginning
file.seekg(0);
char *buffer=newchar[n];
file.read(buffer,n);
file.close();