This is a weird subject that is fraught with a lot of misunderstanding. Please forgive me for not
directly answering your question, but instead giving you a broader picture of the problem, as it appears to me that you do not fully understand it.
People tend to think that if they change all their character types to wchar_t, then their program will magically support Unicode. That's not quite how it works.
'wchar_t' is
not Unicode. It's a wide character. This is a common misconception caused by the way WinAPI treats the somewhat poorly named 'UNICODE' macro to switch between wchar_t and char types for the
TCHAR
typedef.
So let's start by defining what Unicode actually is:
Unicode is a system which maps glyphs to a unique numerical identifier (aka, a "code point"). That's it. You can think of it as a giant lookup table... where you give it a code point, and it gives you back a glyph -- or vice versa.
Examples:
glyph = Unicode codepoint
--------------------
a = U+0061
ɻ = U+027B
স = U+09B8
𠀱 = U+20031
|
There is conceptually no limit to the number of code points that can exist. Though realistically, any legal codepoint can exist within a 32-bit integer. However... a 16-bit integer (like a wchar_t on windows), is
too small as some codepoints are above U+FFFF (even though they are extraordinarily rarely used).
The next thing you have to understand is the
encoding. Just saying "I have unicode text" isn't specific enough... unicode characters can be represented several different ways. Some of the most common are:
UTF-8, where codepoints are represented by 1 or more single-byte characters
UTF-16, where codepoints are represented by 1 or more two-byte characters
UTF-32, where codepoints are represented by exactly 1 four-byte character.
To get even more specific... UTF-16 and UTF-32 need to have their endianness specified... since on disk, a multi-byte value can be represented in either big or little endian. So you could say that the encodings are really:
UTF-8
UTF-16LE (little endian)
UTF-16BE (bit endian)
UTF-32LE
UTF-32BE
UTF-8 doesn't need to concern itself with endianness because its values are only 1 byte wide.
Let's take a look at a simple example of how each encoding represents each codepoint. Let's start with an easy one... U+0061 (a):
1 2 3 4 5
|
UTF-8: 61 (0x61)
UTF-16LE: 61 00 (0x0061)
UTF-16BE: 00 61 (0x0061)
UTF-32LE: 61 00 00 00 (0x00000061)
UTF-32BE: 00 00 00 61 (0x00000061)
|
Pretty simple. This codepoint can be represented in 1 unit. The only difference between the encodings is how many bytes there are per unit, and the order in which those bytes are written.
Now let's look at a more complex one.... U+027B (ɻ)
1 2 3 4 5
|
UTF-8: C9 BB (0x61, 0xBB)
UTF-16LE: 7B 02 (0x027B)
UTF-16BE: 02 7B (0x027B)
UTF-32LE: 7B 02 00 00 (0x0000027B)
UTF-32BE: 00 00 02 7B (0x0000027B)
|
This codepoint is too large to fit in a single byte, so UTF-8 must use 2 units to represent it. UTF-16 and UTF-32, on the other hand, handle it simply.
Now a big bad boy... U+20031 (𠀱)
1 2 3 4 5
|
UTF-8: F0 A0 80 B1 (0xF0, 0xA0, 0x80, 0xB1)
UTF-16LE: 40 D8 31 DC (0xD840, 0xDC31)
UTF-16BE: D8 40 DC 31 (0xD840, 0xDC31)
UTF-32LE: 31 00 02 00 (0x00020031)
UTF-32BE: 00 02 00 31 (0x00020031)
|
As you can see, this is too big to fit in a single 16-bit unit... so UTF-16 has to spread it out across two of them. Meanwhile, UTF-8 takes 4 units to represent it.
So what does this mean for your problem?
A few things.
#1 - You need to decide on how you want your data encoded. UTF-8 is common because it compresses well for English text, but can be clunky to work with if you are doing text editing since codepoints can frequently be of variable length. UTF-32 is the opposite: easy to work with because everything is fixed length, but uses a lot of space.
#2 - You don't need to use wchar_t's to support Unicode if you don't want. You can do it with normal chars. As long as all relevant code treats your string data as if it were UTF-8 encoded, you'll be fine.
#3 - The only time (afaik) that you
need to use wchar_t's for Unicode text is when communicating with WinAPI... as it will treat wide strings as UTF-16 encoded strings, but will
not treat char strings as UTF-8.
#4 - Just because WinAPI treats wchar_t strings as UTF-16 does not mean other libraries (like STL's wofstream) do. In fact... if memory serves, wofstream will actually try to 'narrow' the string you pass it before using it.
So if wide characters are not necessarily Unicode... and if wofstream narrows the strings you give it... then what good is wofstream?
Good question. I still don't know. But I know wofstream is weird and problematic enough that I avoid it entirely.
So as for you actual questions:
Now my question is, should I just be writing out the Character field via std::wofstream and std::ofstream separately and just finish off with the remaining fields with std::ofstream or write the whole object out in one write? |
I would get rid of wofstream entirely. It does not help you at all in this endevour.
Does wchar_t really matter with numeric data (i.e. float, int, double)? |
wchar_t is a character type. What you are doing when you pass your struct to ofstream::write is you are giving it a pointer to binary data.. and saying "treat this data as an array of bytes". It'll just blindly and faithfully write those bytes to disk.
When you give it to wofstream::write... you are saying "treat this data as an array of wchar_ts"... which are larger than a byte. Which makes things weirder.
Either way... you are not dealing with string data... so the cast is somewhat erroneous. Though ofstream (the non-wide one) will be more faithful since it won't mess with the data.
I went on and on about Unicode in this post... but if you want to know more about binary files... I strongly recommend you get a Hex Editor (a good free one is HxD) and actually look at the files you are creating to see if they match what you expect. I've also written some articles on writing binary files. Links below.
http://www.cplusplus.com/articles/DzywvCM9/
http://www.cplusplus.com/articles/oyhv0pDG/