Hi,
So, I have a text file in UTF-8 containing cyrillic text. I want to copy it to another file and print it in console. Copying is OK, but on the output on the screen contains all wrong symbols. Here's the code
Each Unicode character corresponds to each wchar_t in the wstring.
Almost. wchar_t isn't wide enough to hold codepoints above U+FFFF, so in that case you'd need 2 wchar_ts. Though granted those codepoints are very very rarely used.
From what I've seen... WinAPI treats wchar_t as UTF-16. The C++ standard library, however, it ambivilant / clueless as to the existance of Unicode. I remember trying to help someone else output Unicode to the console and it was a huge pain. Ultimately, the easiest way to do it was with WinAPI calls (ie: not using the cout/wcout at all)
EDIT:
Also -- reading UTF-8 isn't difficult. It's not something you need a whole library for.
I can throw together a function for you when I get home from work, but my break time is almost up so I don't have time now.
- it doesn't look for a null termination character, it just reads until EOF or some other file error
- it has minimal error checking
- I didn't actually test it
- it assumes wchar_t is 16-bits wide
- it decodes to UTF-16 surrogate pairs for codepoints above U+FFFF (4-byte codes)
I like mine better, but I will steal your loops.
Contrary to what I was expecting, this version and the last one are just as fast (tested with 50 million characters).
void UTF8_WC(wchar_t *dst,const uchar *src,ulong srcl){
for (ulong a=0;a<srcl;a++){
uchar byte=*src++;
wchar_t c=0;
if (!(byte&0x80))
c=byte;
else{
ulong size=0,
mask=0x80;
c=byte;
for (;c&mask;mask>>=1)
size++;
size--;
c&=mask-1;
#if WCHAR_MAX==0xFFFF //<-- I don't have much trust in this directive. Neither should you.
if (size>2){
c='?';
size=0;
}
#endif
for (;size;size--,a++){
c<<=6;
c|=*++src&0x3F;
}
}
*dst++=c;
}
}
I don't know which one looks nicer. What do you think?