How can I parse a single character from a unicode text file?

I'm trying to write a program to re-write a Chinese dictionary into a form I can use with another program. I'm trying to write a program that will parse the data character by character and act accordingly. If I do this
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include <fstream>
#include <string>
#include <iostream>

int main () {
	wchar_t c;
	std::wfstream ufile;
  	ufile.open ("/initrd/mnt/dev_save/Storage/Downloads/cedict_ts.u8");
  	
  	std::ofstream ofile;
	ofile.open ("/initrd/mnt/dev_save/Storage/Downloads/cedictxml.u8");
  	c = ufile.get();
  	ofile << c;
       return 0;
}
It just adds "-1" to the file rather than the Chinese character that is the first character in the dictionary. If I use ifstream instead of wfstream, it writes "229". The file is utf-8. What do I have to do to parse a file like this? Do I need an additional library?
The file is utf-8.


For starters, don't use wide characters for file reading. UTF-8 uses 8 bits, whereas wchar_t is generally 16 or 32 bits.

Standard libs kind of suck hardcore for Unicode work. It really helps if you understand how UTF-8 works. I recommend reading the wikipedia article, specifically the description section that has a handy dandy chart:

http://en.wikipedia.org/wiki/Utf-8#Description

If all you need to do is read UTF-8... here's a routine you can use. I think helios or someone posted something similar to this somewhere else on the forum, but it would take me longer to find it than it would to just rewrite it, so....

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
wchar_t GetUTF8(const std::istream& s)
{
  char c = 0;
  wchar_t ret = '?';
  s >> c;

  if(c < 0x80)		// 1-byte code
    ret = c;
  else if(c < 0xC0)     // invalid
    ;
  else if(c < 0xE0)	// 2-byte code
  {
    ret =  (c & 0x1F) << 6;    s >> c;
    ret |= (c & 0x3F);
  }
  else if(c < 0xF0)     // 3-byte code
  {
    ret =  (c & 0x0F) << 12;   s >> c;
    ret |= (c & 0x3F) <<  6;   s >> c;
    ret |= (c & 0x3F);
  }
  else if(c < 0xF8)     // 4-byte code
  {
    // make sure wchar_t is large enough to hold it
    if(std::numeric_limits<wchar_t>::max() > 0xFFFF)
    {
      ret =  (c & 0x07) << 18;   s >> c;
      ret |= (c & 0x3F) << 12;   s >> c;
      ret |= (c & 0x3F) <<  6;   s >> c;
      ret |= (c & 0x3F);
    }
  }

  return ret;
}


EDIT:

Whether or not writing wchar_t's to a file works, that's another matter.
Last edited on
Thank you very much. I'll do the research you recommended when I have a chance.

I tried your routine, but I got a long list of compiler errors starting with
error: ambiguous overload for 'operator>>' in 's >> c'


And you're right about wchar_t and files, I tried assigning a Chinese character to a wchar_t as a constant and writing it to a file, and it just came out as a number. Any solutions?
Woo. After carefully reading the wikipedia article, I realized what you were trying to do. I managed to use this routine
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
unsigned char c;
	std::ifstream ufile;
  	ufile.open ("/initrd/mnt/dev_save/Storage/Downloads/cedict_ts.u8");
  	perror("w");
  	
  	std::ofstream ofile;
	ofile.open ("/initrd/mnt/dev_save/Storage/Downloads/cedictxml.u8");
  	perror("[");
  	
  	while (c != EOF) {
  		if (c <= 127) {
  			std::cout << c;
		}
		c = ufile.get();
	}
to display only ASCII characters. Using the unsigned char type seems to be enough, I just need to tell it not compare bytes that are part of sequences. I think I understand what I need to do now, thank you again for all your help.
Last edited on
Topic archived. No new replies allowed.