...still working on this. I decided to switch to hexdump(more readable):
0000000 e2 96 93 e2 96 92 e2 96 93 e2 96 92 e2 96 91 e2
0000010 96 92 e2 96 91 20 e2 96 91 20 20 20 20 20 20 20
0000020 20 20 20 20 20 20 20 20 e2 8c 90 c2 ac e2 8c 90
0000030 c2 ac e2 8c 90 c2 ac e2 8c 90 c2 ac e2 8c 90 c2
0000040 ac 20 e2 95 90 e2 95 97 0a e2 96 88 e2 96 93 e2
0000050 96 92 e2 96 93 e2 96 92 e2 96 91 20 e2 96 91 20
0000060 e2 96 91 20 20 20 20 20 20 20 20 20 e2 89 a4 20
0000070 3e 20 20 20 e2 8c 90 c2 ac e2 8c 90 c2 ac e2 8c
0000080 90 c2 ac e2 8c 90 c2 ac e2 8c 90 c2 ac 20 e2 95
0000090 91 0a e2 96 93 e2 96 88 e2 96 93 e2 96 92 e2 96
00000a0 93 e2 96 92 e2 96 91 e2 96 92 e2 96 91 20 e2 96
00000b0 91 20 e2 96 91 20 20 20 20 20 20 20 20 20 20 20
00000c0 20 20 20 20 e2 8c 90 c2 ac e2 8c 90 c2 ac e2 8c
00000d0 90 c2 ac 20 e2 95 94 e2 95 9d 0a e2 96 88 e2 96
00000e0 93 e2 96 88 e2 96 93 e2 96 92 e2 96 93 e2 96 92
00000f0 e2 96 93 e2 96 92 e2 96 91 e2 96 92 e2 96 91 e2
0000100 96 92 e2 96 91 20 20 20 20 20 20 20 20 20 e2 96
0000110 bc e2 96 b2 20 20 20 20 e2 8c 90 c2 ac e2 8c 90
0000120 c2 ac 20 e2 95 94 e2 95 9d 0a e2 96 88 e2 96 88
0000130 e2 96 93 e2 96 88 e2 96 93 e2 96 88 e2 96 93 e2
0000140 96 92 e2 96 93 e2 96 92 e2 96 93 e2 96 92 e2 96
0000150 93 e2 96 92 e2 96 91 20 20 20 20 20 20 20 20 20
...
..
.
|
...still doesn't match my Unicode (hex)values. i checked online and it appears I am printing the correct Unicode value for each corresponding character, although it's kind of confusing that none of the hexdump matches -- with the exception of ascii(and i'm guessing the rest of CP1252).
Upon further investigation perhaps this makes sense, and I did notice a pattern: the following codes "e2 97", "e2 96", "e2 95", "e2 94" and also "e2 8c", "e2 88", "e2 87" all repeat a lot throughout the dump, each time followed by a specific character code. These codes seem to correspond to the second byte of each of my Unicode characters, and for single-byte characters they match exactly.
...after doing a little more digging around(
https://en.wikipedia.org/wiki/UTF-8), it appears that the contents of hexdump are in fact UTF-8 whereas the "hex-values" I'm printing in my script are actually Unicode code-points. i think it's worth noting I'm only printing wchars from plane_0 atm, so just dealing with char-codes 2 bytes and shorter. all my single byte chars appear unchanged, although all the code-points corresponding to 2-byte utf-8 are now encoded as such(including leading byte 0xE#).
...so is it not possible to easily print the UTF-8 hex value via
wcout
? do i need write a little function to translate the code-points myself? ...fwiw, I think that may be within my ability ;)
Also, I'd like to bring up one funny thing which sort of threw me for a loop:
That is, when i write the UTF-16 BOM(
dataout << L'\xFEFF';
) into my ostream, it appears to actually encode the UTF-8 BOM(
0000: EF BB BF) into my file. ...unexpected...
It's worth noting that both files appear outwardly identical whether or not i include the BOM, though hexdump exposed the difference:
here is the first line of hexdump from my output including
L'\xFEFF'
BOM:
0000000 ef bb bf e2 96 93 e2 96 92 e2 96 91 20 e2 8c 90 |
and here is with no BOM:
0000000 e2 96 93 e2 96 92 e2 96 91 20 e2 8c 90 c2 ac e2 |
and if i try
dataout << L'\xEF' << L'\xBB' << L'\xBF';
for my BOM, i then get
"" visible at the top of my file, and my hex dump looks like this:
0000000 c3 af c2 bb c2 bf e2 96 93 e2 96 92 e2 96 91 20 |
...so at this point I've deduced that there's some "smart stuff" going on behind the compiler. I'm just confused as to why i have to write the UTF-16 BOM in my code in order to get the UTF-8 BOM in my file. I'm sure there is an explanation...
Anyway, is
dataout << L'\xFEFF';
the preferred way to write the UTF-8 BOM using a
wofstream
?