Using the multi-byte character set |
As far as I know, that setting does
absolutely nothing apart from #defining the UNICODE and/or _UNICODE macros for the preprocessor.
And defining UNICODE only changes what 'TCHAR's are. But you're probably not using TCHARs because you're smart, and TCHARs are retarded. So I would not worry about this.
Now, out of all of that and if I want to use Unicode, what variables types/typedefs will I need to use in place of each of those? "wchar_t" was mentioned in place of char but what about the string? Will I still be able to use those or will something else be needed |
Without getting to technical on what Unicode is vs. encoding formats, I will say that you definitely will want to use Unicode. You also have 2 general options:
1) Use UTF-16, where each "character" is 2 bytes (usually). On Windows this means using
wchar_t
and
std::wstring
for chars/strings.
2) Use UTF-8, where each "character" is 1 byte minimum, but characters outside the basic ASCII set are represented by multiple bytes. For example the ą character mentioned before is represented as 2 bytes (chars) in sequence: 0xC4, 0x84
Each have their pros and cons. The biggest downside to UTF8 is that it can get confusing if you're going to be working on individual characters like you are. For example, if you want to replace 'a' with 'ą', you will actually
increase the size of the string, because 'a' is represented in 1 char, whereas 'ą' needs 2 chars. Some characters may even need 3 chars.
With UTF-16, pretty much everything (aside from very, very seldomly used glyphs) can be represented in a single 16-bit wchar_t. Also, WinAPI functions naturally accept wchar_t strings and interpret them as UTF-16.
So what it sounds like is that for you, UTF-16 (wchar_t, wstring) would be easier to work with.
-----------
If you want to use UTF-16: You must use wchar_t's, wstrings, "wide" literals.. ie:
1 2
|
L"foo"; // <- this, the 'L' makes it wide (wchar_t)
"foo"; // vs. this, which is narrow (char)
|
And you must call the 'W' version of WinAPI functions to indicate you have UTF-16 strings. IE:
SetWindowTextW
instead of
SetWindowText
.
You'll want the 'W' version of any structs, too... such as
OPENFILENAMEW
, etc.
--------------
If you want to use UTF-8, you'll have some extra work, because I don't think you can give UTF-8 strings to WinAPI. I thought you could if you set the codepage to UTF-8, but after checking MSDN, I don't see any way to do that (except for the console).
This means you'll have to expand UTF-8 to UTF-16 with MultibyteToWideChar (
http://msdn.microsoft.com/en-us/library/windows/desktop/dd319072%28v=vs.85%29.aspx ), then pass the UTF-16 string to WinAPI as described above. In light of this, using UTF-16 for everything seems more and more like the way to go.
---------
I'm not sure this will work even with wide strings, as it depends greatly on how your text editor saves the .cpp file, and how the compiler decides to interpret that string.
I think there was some unicode support added to C++11, but I'm not entirely clear on how you'd apply it to this, as I thought it was only for string literals and not individual characters. Also it would still be subject to how your editor encodes the file.
The only way I know of to make this fullproof would be to use the raw U+ character code. For example.... 'Ǽ' is designed as U+01FC (take a look in the Windows CharMap program and it'll display all that stuff)... which means that instead of doing this:
if( pInput[i] == 'Ǽ' ) // <- which probably won't work
You could do this:
if( pInput[i] == 0x01FC ) // which will definitely work as long as pInput is UTF-16
Of course that's not as easy to read... or write....