Reading & Printing Unicode / UTF-8 Mac

Forum

Forum
Beginners
Reading & Printing Unicode / UTF-8 Mac

Reading & Printing Unicode / UTF-8 Mac

Hi all, long-time-lurker but 1st-time-poster here, I recently took a dive back into Unicode while trying to research old computer graphics, and quickly found myself in terminal compiling g++ again. It has been a few years since my last foray into this area, and while i think technically wchar_t already existed back then, i don't remember finding the right tutorials, and consequently all i ended up doing was generating plane_0 from hex from some sample code found on stackoverflow(maybe i can make another post about that some time):

http://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl/148766#148766

http://stackoverflow.com/questions/29474088/write-utf8-representation-of-unicode-to-file

http://stackoverflow.com/questions/19968705/unsigned-integer-as-utf-8-value

...my current interests though involve old ansi graphics, although at the moment I'm sort of experimenting with different fonts and recreating the retro style using unicode(with varying success). At this point I'm not sure if i won't change direction and try to find some way to use the old .ans encodings themselves, but for now it has gotten me playing around with wstrings for the moment...

Basically I'm trying to parse some random unicode snippets i found online(wikipedia), and generate UTF8 versions of old code-pages. i found this thread very helpful:

http://www.cplusplus.com/forum/beginner/107125/

Unfortunatly it was for windows(I didn't see one as good for mac/unix), and while i do have a Win7 laptop, I am currently working mostly on my Mac(10.11).

I got a bunch of compile errors before I finally realized that you don't need <io.h> or <fcntl.h> on mac.

here's some of my compile errors:


fatal error: 'io.h' file not found
#include <io.h>

error: use of undeclared identifier '_O_U16TEXT'
        _setmode(fileno(stdout), _O_U16TEXT);

error: use of undeclared identifier '_O_U8TEXT'
        _setmode(fileno(stdout), _O_U8TEXT);

error: use of undeclared identifier '_setmode'
        _setmode(fileno(stdout), 0x00040000);

error: no member named '_setmode' in namespace 'std'
        int oldMode = std::_setmode(std::_fileno(stdout), std::_O_U8TEXT);

error: no member named '_fileno' in namespace 'std'
        int oldMode = _setmode(std::_fileno(stdout), _O_U8TEXT);

error: no member named '_setmode' in namespace 'std'
        std::_setmode(fileno(stdout), 0x00040000);

anyway, it was a bit confusing until i realized the two libs were system specific(and not needed). played around with it and now it works!

here's my functional code(it basically just finds '\t' and grabs the next char):

// testUTF.cpp
//
// g++ -o quicktest testUTF.cpp
// ./quicktest infile.txt

#include <stdio.h> //<cstdio>
#include <stdlib.h>
#include <iostream>
#include <fstream>
#include <string>

#include <locale>
#include <codecvt>

int main(int argc, char* argv[]){
    if(argc < 2){
        std::wcout << "no input filename.txt given\n";
        return 0;
    }else{
        std::wifstream datain;
        std::wofstream dataout;
        datain.open(argv[1]);
        std::string outfilename = "out_";
        outfilename += argv[1];
        dataout.open(outfilename);
        
        std::locale my_locale(std::locale(), new std::codecvt_utf8<wchar_t>);
        datain.imbue(my_locale);
        dataout.imbue(my_locale); // Also imbue output
        std::wcout.imbue(my_locale); // Also wcout

        wchar_t bom = L'\0';
        datain.get(bom);
        dataout << L'\xFEFF'; // write BOM
        
        int counter = 0;
        std::wstring outstring;// = (wchar_t) ''; // "";
        while(datain.good() && !datain.eof()){
            std::wstring line;
            getline(datain, line);  //get line
            for (int i=0; i<line.length(); i++) {  //scan line
                if (line[i] == '\t') {//find tab
                    outstring += line[i+1];
                    counter++;
                    if (counter % 16 == 0) {
                        outstring += '\n';//"\n";
                    }
                }
            }
        }
        dataout << outstring;
        //std::cout << outstring;
        std::wcout << outstring;// << std::endl;
        datain.close();
        dataout.close();
    }
    return 0;
}

...and my parsed code-page(CP437):


N☺☻♥♦♣♠•◘○◙♂♀♪♫☼
►◄↕‼¶§▬↨↑↓→←∟↔▲▼
 !"#$%&'()*+,-./
0123456789:;<=>?
@ABCDEFGHIJKLMNO
PQRSTUVWXYZ[\]^_
`abcdefghijklmno
pqrstuvwxyz{|}~⌂
ÇüéâäàåçêëèïîìÄÅ
ÉæÆôöòûùÿÖÜ¢£¥₧ƒ
áíóúñÑªº¿⌐¬½¼¡«»
░▒▓│┤╡╢╖╕╣║╗╝╜╛┐
└┴┬├─┼╞╟╚╔╩╦╠═╬╧
╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀
αßΓπΣσµτΦΘΩδ∞φε∩
≡±≥≤⌠⌡÷≈°∙·√ⁿ²■

...so now you have a jump start on printing Unicode(UTF-8) to file on Mac.

Last edited on

TLoke (10)

My apologies, the BOM in the above code is for UTF-16, although for what it's worth, TextEdit didn't seem to know the difference...

also getting the BOM from my infile was causing me to miss the first character of my ifstream, and I'm not sure what the point was since i already imbued, so at this point chosen to leave it out.

I'm still having some trouble with writing the UTF-8 BOM(0xEF,0xBB,0xBF) to my outfile though. I've tried a number of solutions and so far i either get "ï»¿" written at the top of my file or it fails to write entirely. I read around and apparently a BOM is not recommended? although it is supposed to be allowed under UTF-8?

anyway, for now on my mac everything seems to work without a BOM. here is another UTF-8 parser I'm working on:

// g++ -o getUTF8 getUTF8.cpp
// ./getUTF8 filename.txt
// ./getUTF8

#include <cstdio>   // (stdio.h)
#include <cstdlib>  // (stdlib.h)
#include <iostream>
#include <fstream>
#include <string>

#include <locale>
#include <codecvt>

int main(int argc, char* argv[]){
    if(argc < 2){
        std::wcout << "no input filename.txt given\n";
        return 0;
    }else{
        std::wifstream datain;
        std::wofstream dataout;
        datain.open(argv[1]);
        std::string outfilename = "charlist_";
        outfilename += argv[1];
        dataout.open(outfilename);
        
        std::locale my_locale(std::locale(), new std::codecvt_utf8<wchar_t>);
        datain.imbue(my_locale);
        dataout.imbue(my_locale); // Also imbue output
        std::wcout.imbue(my_locale); // Also wcout
        
        //dataout << L'\xFEFF'; // write BOM: utf16? ...is this correct? or necessary?
        //dataout << L'\xEF' << L'\xBB' << L'\xBF'; // ï»¿ visibly written at top
        //dataout << 0xEF << 0xBB << 0xBF; // 239187191 visibly written at top
        //dataout << L'\xEFBBBF'; // resulting file is entirely blank
        
        int counter = 0;
        
        wchar_t charlist[256] = { };
        int numunique = 0;
        
        while(datain.good() && !datain.eof()){
            std::wstring line;
            getline(datain, line);
            for (int i=0; i<line.length(); i++) {  //scan line
                bool isnew = true;
                for (int j=0; j<numunique; j++) {
                    if (line[i] == charlist[j]) {
                        isnew = false;
                    }
                }
                if(isnew){
                    charlist[numunique] = line[i];
                    numunique++;
                    if (numunique > 255){
                        std::wcout << "to many unique characters\n";
                        return 0;
                    }
                }
            }
        }
        for (int i=0; i<numunique; i++) {
            dataout << charlist[i];
            std::wcout << charlist[i];
            if ((i+1) % 16 == 0) {
                dataout << '\n';
                std::wcout << '\n';
            }
        }
        dataout << "\ndetails:\nchar\tvalue\thex\n";
        std::wcout << "\ndetails:\nchar\tvalue\thex\n";
        for (int i=0; i<numunique; i++) {
            int charval = (long) charlist[i];
            dataout << charlist[i] << '\t' << std::dec << charval 
		<< "\t0x" << std::hex << charval;
            std::wcout << charlist[i] << '\t' << std::dec << charval 
		<< "\t0x" << std::hex << charval;
            dataout << '\n';
            std::wcout << '\n';
        }
        std::wcout << '\n';
        datain.close();
        dataout.close();
    }
    return 0;
}

Last edited on

TLoke (10)

...and here's a text sample i made for testing:

▓▒▓▒░▒░ ░               ⌐¬⌐¬⌐¬⌐¬⌐¬ ═╗
█▓▒▓▒░ ░ ░         ≤ >   ⌐¬⌐¬⌐¬⌐¬⌐¬ ║
▓█▓▒▓▒░▒░ ░ ░               ⌐¬⌐¬⌐¬ ╔╝
█▓█▓▒▓▒▓▒░▒░▒░         ▼▲    ⌐¬⌐¬ ╔╝
██▓█▓█▓▒▓▒▓▒▓▒░         ▼▲▼~      ╚╗
█▓█▓█▓█▓█▓█▓█▓▒░       ▼ ▼▲        ╚╗
▓█▓▒▓▒▓▒▓█▓▒▓▒░○▐▀▌○      ▼         ╚╗
█▓▒▓▒░▒░▒▓▒░▒░▄██▄█◙▄          ◥◤◢◣◥◤║
▓▒▓▒░▒░ ░▒░ ░ ∙█▌■▐█∙      ⇉ ⇊ ◢◣◥◤◢◣║
▒▓▒░ ░ ░ ░    ▀██▀██▀      ⇈ ⇇ ◥◤◢◣◥◤║
░▒░ ░          ○▐▄▌○   ЯR   ┌┐ ◢◣◥◤◢◣║
▒░ ░    ┌─┐                ─┤╘╦╗   ╒═╝
░ ♥ ♠┌───┐┘  ▄▄▄▄   ┌─┐┌┐   ├─╢║   │
▒░♣ ♦│²√i│  ▐▄▌▐▄▌  │├┼┤│  ┼┤ ║╟───┘
░▒░ ┌└───┘ ▬█■▐▌■█▬ │  │├┐ ┼│╒╣║
▒░▒░└─┘     ▐▀▌▐▀▌  └──┘└┴┬┴┼┘╠╬╗┌┬┐
▓▒░▒░ ░      ▀▀▀▀    ╔═╤═══╧═╧═╝╚╬╛│
▒░ ░ ░               ║ └──┐   ╔══╝ │
░ ░ ░ ░        Σ     ╚════╧═══╝    ╘╗
            ᔧ      o  ṓ    ╔════════╝
⎲  깅 킸 令           ╔════╝
⎳         丯 穴      ║
    ꃿ   ꤰꤻꤼ           ╚═╗
a     ﲄ   ﾼﾨ     b     ║
  ｼ                 ╔═══╝
 ⺋ ㅏ 䢝        ╔═══╝
   ䷝     ══════╝

...and the result:

▓▒░ ⌐¬═╗█≤>║╔╝▼▲
~╚○▐▀▌▄◙◥◤◢◣∙■⇉⇊
⇈⇇ЯR┌┐─┤╘╦╒♥♠┘├╢
│♣♦²√i┼╟└▬╣┴┬╠╬╤
╧╛Σᔧoṓ⎲깅킸令⎳丯穴ꃿꤰꤻ
ꤼaﲄﾼﾨbｼ⺋ㅏ䢝䷝
details:
char	value	hex
▓	9619	0x2593
▒	9618	0x2592
░	9617	0x2591
 	32	0x20
⌐	8976	0x2310
¬	172	0xac
═	9552	0x2550
╗	9559	0x2557
█	9608	0x2588
≤	8804	0x2264
>	62	0x3e
║	9553	0x2551
╔	9556	0x2554
╝	9565	0x255d
▼	9660	0x25bc
▲	9650	0x25b2
~	126	0x7e
╚	9562	0x255a
○	9675	0x25cb
▐	9616	0x2590
▀	9600	0x2580
▌	9612	0x258c
▄	9604	0x2584
◙	9689	0x25d9
◥	9701	0x25e5
◤	9700	0x25e4
◢	9698	0x25e2
◣	9699	0x25e3
∙	8729	0x2219
■	9632	0x25a0
⇉	8649	0x21c9
⇊	8650	0x21ca
⇈	8648	0x21c8
⇇	8647	0x21c7
Я	1071	0x42f
R	82	0x52
┌	9484	0x250c
┐	9488	0x2510
─	9472	0x2500
┤	9508	0x2524
╘	9560	0x2558
╦	9574	0x2566
╒	9554	0x2552
♥	9829	0x2665
♠	9824	0x2660
┘	9496	0x2518
├	9500	0x251c
╢	9570	0x2562
│	9474	0x2502
♣	9827	0x2663
♦	9830	0x2666
²	178	0xb2
√	8730	0x221a
i	105	0x69
┼	9532	0x253c
╟	9567	0x255f
└	9492	0x2514
▬	9644	0x25ac
╣	9571	0x2563
┴	9524	0x2534
┬	9516	0x252c
╠	9568	0x2560
╬	9580	0x256c
╤	9572	0x2564
╧	9575	0x2567
╛	9563	0x255b
Σ	931	0x3a3
ᔧ	5415	0x1527
o	111	0x6f
ṓ	7763	0x1e53
⎲	9138	0x23b2
깅	44613	0xae45
킸	53432	0xd0b8
令	20196	0x4ee4
⎳	9139	0x23b3
丯	20015	0x4e2f
穴	31348	0x7a74
ꃿ	41215	0xa0ff
ꤰ	43312	0xa930
ꤻ	43323	0xa93b
ꤼ	43324	0xa93c
a	97	0x61
ﲄ	64644	0xfc84
ﾼ	65468	0xffbc
ﾨ	65448	0xffa8
b	98	0x62
ｼ	65404	0xff7c
⺋	11915	0x2e8b
ㅏ	12623	0x314f
䢝	18589	0x489d
䷝	19933	0x4ddd

I'm still a little unsure about my method to grab the decimal and hex values for each character, and somewhat puzzled as to why my hex values don't match up with the xxd terminal output...?:

0000000: e296 93e2 9692 e296 93e2 9692 e296 91e2  ................
0000010: 9692 e296 9120 e296 9120 2020 2020 2020  ..... ...       
0000020: 2020 2020 2020 2020 e28c 90c2 ace2 8c90          ........
0000030: c2ac e28c 90c2 ace2 8c90 c2ac e28c 90c2  ................
0000040: ac20 e295 90e2 9597 0ae2 9688 e296 93e2  . ..............
0000050: 9692 e296 93e2 9692 e296 9120 e296 9120  ........... ... 
0000060: e296 9120 2020 2020 2020 2020 e289 a420  ...         ... 
0000070: 3e20 2020 e28c 90c2 ace2 8c90 c2ac e28c  >   ............
0000080: 90c2 ace2 8c90 c2ac e28c 90c2 ac20 e295  ............. ..
0000090: 910a e296 93e2 9688 e296 93e2 9692 e296  ................
00000a0: 93e2 9692 e296 91e2 9692 e296 9120 e296  ............. ..
00000b0: 9120 e296 9120 2020 2020 2020 2020 2020  . ...           
00000c0: 2020 2020 e28c 90c2 ace2 8c90 c2ac e28c      ............
00000d0: 90c2 ac20 e295 94e2 959d 0ae2 9688 e296  ... ............
00000e0: 93e2 9688 e296 93e2 9692 e296 93e2 9692  ................
00000f0: e296 93e2 9692 e296 91e2 9692 e296 91e2  ................
0000100: 9692 e296 9120 2020 2020 2020 2020 e296  .....         ..
0000110: bce2 96b2 2020 2020 e28c 90c2 ace2 8c90  ....    ........
0000120: c2ac 20e2 9594 e295 9d0a e296 88e2 9688  .. .............
0000130: e296 93e2 9688 e296 93e2 9688 e296 93e2  ................
0000140: 9692 e296 93e2 9692 e296 93e2 9692 e296  ................
0000150: 93e2 9692 e296 9120 2020 2020 2020 2020  .......         
0000160: e296 bce2 96b2 e296 bc7e 2020 2020 2020  .........~      
0000170: e295 9ae2 9597 0ae2 9688 e296 93e2 9688  ................
0000180: e296 93e2 9688 e296 93e2 9688 e296 93e2  ................
0000190: 9688 e296 93e2 9688 e296 93e2 9688 e296  ................
00001a0: 93e2 9692 e296 9120 2020 2020 2020 e296  .......       ..
...
..
.

Last edited on

coder777 (8439)

i either get "ï»¿" written at the top of my file or it fails to write entirely

What does that mean? The output of an invalid char will not be valid.

I read around and apparently a BOM is not recommended? although it is supposed to be allowed under UTF-8?

For UTF-8 a BOM (byte order mark) does not sever any purpose. See:

https://en.wikipedia.org/wiki/Byte_order_mark

For "ï»¿": Take a look at 'Bytes as CP1252 characters'. It's indeed the UTF-8 BOM.

I'm still a little unsure about my method to grab the decimal and hex values for each character, and somewhat puzzled as to why my hex values don't match up with the xxd terminal output...?:

It looks like you have a problem with negative values. e296 -> -7530

TLoke (10)

...still working on this. I decided to switch to hexdump(more readable):

0000000 e2 96 93 e2 96 92 e2 96 93 e2 96 92 e2 96 91 e2
0000010 96 92 e2 96 91 20 e2 96 91 20 20 20 20 20 20 20
0000020 20 20 20 20 20 20 20 20 e2 8c 90 c2 ac e2 8c 90
0000030 c2 ac e2 8c 90 c2 ac e2 8c 90 c2 ac e2 8c 90 c2
0000040 ac 20 e2 95 90 e2 95 97 0a e2 96 88 e2 96 93 e2
0000050 96 92 e2 96 93 e2 96 92 e2 96 91 20 e2 96 91 20
0000060 e2 96 91 20 20 20 20 20 20 20 20 20 e2 89 a4 20
0000070 3e 20 20 20 e2 8c 90 c2 ac e2 8c 90 c2 ac e2 8c
0000080 90 c2 ac e2 8c 90 c2 ac e2 8c 90 c2 ac 20 e2 95
0000090 91 0a e2 96 93 e2 96 88 e2 96 93 e2 96 92 e2 96
00000a0 93 e2 96 92 e2 96 91 e2 96 92 e2 96 91 20 e2 96
00000b0 91 20 e2 96 91 20 20 20 20 20 20 20 20 20 20 20
00000c0 20 20 20 20 e2 8c 90 c2 ac e2 8c 90 c2 ac e2 8c
00000d0 90 c2 ac 20 e2 95 94 e2 95 9d 0a e2 96 88 e2 96
00000e0 93 e2 96 88 e2 96 93 e2 96 92 e2 96 93 e2 96 92
00000f0 e2 96 93 e2 96 92 e2 96 91 e2 96 92 e2 96 91 e2
0000100 96 92 e2 96 91 20 20 20 20 20 20 20 20 20 e2 96
0000110 bc e2 96 b2 20 20 20 20 e2 8c 90 c2 ac e2 8c 90
0000120 c2 ac 20 e2 95 94 e2 95 9d 0a e2 96 88 e2 96 88
0000130 e2 96 93 e2 96 88 e2 96 93 e2 96 88 e2 96 93 e2
0000140 96 92 e2 96 93 e2 96 92 e2 96 93 e2 96 92 e2 96
0000150 93 e2 96 92 e2 96 91 20 20 20 20 20 20 20 20 20
...
..
.

...still doesn't match my Unicode (hex)values. i checked online and it appears I am printing the correct Unicode value for each corresponding character, although it's kind of confusing that none of the hexdump matches -- with the exception of ascii(and i'm guessing the rest of CP1252).

Upon further investigation perhaps this makes sense, and I did notice a pattern: the following codes "e2 97", "e2 96", "e2 95", "e2 94" and also "e2 8c", "e2 88", "e2 87" all repeat a lot throughout the dump, each time followed by a specific character code. These codes seem to correspond to the second byte of each of my Unicode characters, and for single-byte characters they match exactly.

...after doing a little more digging around(https://en.wikipedia.org/wiki/UTF-8), it appears that the contents of hexdump are in fact UTF-8 whereas the "hex-values" I'm printing in my script are actually Unicode code-points. i think it's worth noting I'm only printing wchars from plane_0 atm, so just dealing with char-codes 2 bytes and shorter. all my single byte chars appear unchanged, although all the code-points corresponding to 2-byte utf-8 are now encoded as such(including leading byte 0xE#).

...so is it not possible to easily print the UTF-8 hex value via wcout? do i need write a little function to translate the code-points myself? ...fwiw, I think that may be within my ability ;)

Also, I'd like to bring up one funny thing which sort of threw me for a loop:
That is, when i write the UTF-16 BOM(dataout << L'\xFEFF';) into my ostream, it appears to actually encode the UTF-8 BOM(0000: EF BB BF) into my file. ...unexpected...

It's worth noting that both files appear outwardly identical whether or not i include the BOM, though hexdump exposed the difference:

here is the first line of hexdump from my output including L'\xFEFF' BOM:

0000000 ef bb bf e2 96 93 e2 96 92 e2 96 91 20 e2 8c 90

and here is with no BOM:

0000000 e2 96 93 e2 96 92 e2 96 91 20 e2 8c 90 c2 ac e2

and if i try dataout << L'\xEF' << L'\xBB' << L'\xBF'; for my BOM, i then get "ï»¿" visible at the top of my file, and my hex dump looks like this:

0000000 c3 af c2 bb c2 bf e2 96 93 e2 96 92 e2 96 91 20

...so at this point I've deduced that there's some "smart stuff" going on behind the compiler. I'm just confused as to why i have to write the UTF-16 BOM in my code in order to get the UTF-8 BOM in my file. I'm sure there is an explanation...

Anyway, is dataout << L'\xFEFF'; the preferred way to write the UTF-8 BOM using a wofstream?

Last edited on

coder777 (8439)

...so is it not possible to easily print the UTF-8 hex value via wcout?

You should not print UTF-8 with wcout. Rather use cout. Since wcout expects two byte characters and UTF-8 is only one byte.

That is, when i write the UTF-16 BOM(dataout << L'\xFEFF';) into my ostream, it appears to actually encode the UTF-8 BOM(0000: EF BB BF) into my file.

Can't imagine that this happens. Something else must be wrong.

Those hexdumps seems not to make any sense. Again: UTF-8 and wofstream don't match. Use ofstream instead.

TLoke (10)

Are you some sort of joker? first "negative numbers" and now this. I'm just trying to figure things out and i don't need someone repeatedly giving me wrong answers.

Just to clarify, everyone should know that UTF-8 is not just 1 byte long: https://tools.ietf.org/html/rfc3629

...and about the BOM, I'm not making anything up. You can check yourself. I strongly expect you will get the same result...

...btw, there is nothing wrong with my hexdump. It is definitely UTF-8. here check out these links
(remember to word search "UTF-8"):
https://www.fileformat.info/info/unicode/char/2593/index.htm
https://www.fileformat.info/info/unicode/char/2592/index.htm
https://www.fileformat.info/info/unicode/char/2591/index.htm

mbozzi (3910)

UTF-8 is not just 1 byte long

UTF-8 code units are one byte long. A single code point is encoded using a variable length sequence of code units; the resulting byte string needs to be written using ofstream. No wide characters are involved.

Last edited on

TLoke (10)

interesting, so you are saying I can do it byte by byte using cout. what about reading my file though? or parsing (multibyte)code-points without using wchar_t? It just sounds more difficult to do that way... that said, I was looking for an easy way to output UTF-8 code-points in hex, whereas my current code keeps giving me Unicode.

Either way, I'd just like to note, my code works and already prints UTF-8, as demonstrated.

...and i noticed i have been reported in retaliation... so now what? It's just that i take exception to someone repeatedly posting bad information on my post, which i have spent my valuable time working on, and who keeps denying what I have clearly demonstrated to be true. test the code if you don't believe me. It should compile and run the same for all unix/linux systems(or is this not a true statement?).

TLoke (10)

...and sorry if this is posted in the wrong forum, but the thread I wanted to reply to in the first place was also posted in the beginner forum, although it dealt with Windows(and was archived). So if someone want's to test my code in Windows, all they need to do is make the appropriate changes found in the other thread, which i linked in my first post at the top. thanks!

Last edited on

mbozzi (3910)

I was looking for an easy way to output UTF-8 code-points in hex

UTF-8 code points don't exist. instead, there are Unicode code points which can be encoded using one of the Unicode Transmission Formats, of which UTF-8 is an example.

On Unix-like systems, Mac OS X included, wchar_t is generally four bytes wide. A 4-byte wchar_t (vs. a 2-byte wchar_t) is generally more useful for Unicode transmission because it preserves a one-to-one correspondence of UTF-32 code units to Unicode scalar values. UTF-32 is sometimes used internally, partially because this correspondence can improve the time complexity of certain text processing algorithms. Internal use of UTF-32 typically implies some amount of round-trip format conversion to/from UTF-8, because it's ubiquitous as an interchange format.

A Unicode scalar value is a code point number, excepting surrogates. So when you convert the UTF-8 representation of ▓ to UTF-32, you get a 4-byte number with the expected value 0x2593. This is accomplished by reading from a stream imbued with my_locale. Correspondingly, dataout performs the round-trip conversion.

when i write the UTF-16 BOM into my ostream, it appears to actually encode the UTF-8 BOM

Other than the fact that characters are wider than you think, this is explained by the round-trip conversion.

Last edited on

TLoke (10)

...0x2593 is only 2 bytes... sorry, the leading zeros must be hidden[beginner here]. I just want to reiterate that my program is writing UTF-8 successfully (like i asked it to).

...So I'm assuming the hex values i'm looking at are UTF-32 then? It's funny how everything else is getting converted properly before printing though...

...and about the BOM, i just want to note that i'm typing in the Big Endian(i think) UTF-16 BOM, although my out-file has a UTF-8 BOM.

Last edited on

Topic archived. No new replies allowed.

C++

Forum

Reading & Printing Unicode / UTF-8 Mac