I'm trying to write a program to re-write a Chinese dictionary into a form I can use with another program. I'm trying to write a program that will parse the data character by character and act accordingly. If I do this
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#include <fstream>
#include <string>
#include <iostream>
int main () {
wchar_t c;
std::wfstream ufile;
ufile.open ("/initrd/mnt/dev_save/Storage/Downloads/cedict_ts.u8");
std::ofstream ofile;
ofile.open ("/initrd/mnt/dev_save/Storage/Downloads/cedictxml.u8");
c = ufile.get();
ofile << c;
return 0;
}
It just adds "-1" to the file rather than the Chinese character that is the first character in the dictionary. If I use ifstream instead of wfstream, it writes "229". The file is utf-8. What do I have to do to parse a file like this? Do I need an additional library?
For starters, don't use wide characters for file reading. UTF-8 uses 8 bits, whereas wchar_t is generally 16 or 32 bits.
Standard libs kind of suck hardcore for Unicode work. It really helps if you understand how UTF-8 works. I recommend reading the wikipedia article, specifically the description section that has a handy dandy chart:
If all you need to do is read UTF-8... here's a routine you can use. I think helios or someone posted something similar to this somewhere else on the forum, but it would take me longer to find it than it would to just rewrite it, so....
Thank you very much. I'll do the research you recommended when I have a chance.
I tried your routine, but I got a long list of compiler errors starting with
error: ambiguous overload for 'operator>>' in 's >> c'
And you're right about wchar_t and files, I tried assigning a Chinese character to a wchar_t as a constant and writing it to a file, and it just came out as a number. Any solutions?
Woo. After carefully reading the wikipedia article, I realized what you were trying to do. I managed to use this routine
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
unsignedchar c;
std::ifstream ufile;
ufile.open ("/initrd/mnt/dev_save/Storage/Downloads/cedict_ts.u8");
perror("w");
std::ofstream ofile;
ofile.open ("/initrd/mnt/dev_save/Storage/Downloads/cedictxml.u8");
perror("[");
while (c != EOF) {
if (c <= 127) {
std::cout << c;
}
c = ufile.get();
}
to display only ASCII characters. Using the unsigned char type seems to be enough, I just need to tell it not compare bytes that are part of sequences. I think I understand what I need to do now, thank you again for all your help.