I am still working on my project which will be reading some old data from some old DOS files. The data stored there is naturally, char*. Once I read in my character array, how do I assign this to a wstring since my application is UNICODE?
Here is my current solution:
1 2 3 4 5 6
wchar_t* Class::Function(char *pName)
{
//I verify the pointer and such first, then do the below
this->_Name.assign(pName, (pName + strlen(pName));
returnthis->_Name.c_str();
}
1) How is the source data encoded (UTF-8 seems unlikely if this is old DOS files... but I guess it depends on HOW old).
2) Do you care about preserving anything beyond the ASCII set?
3) How is the dest data to be encoded (I would assume UTF-16?). Given that you said "UNICODE" I'm assuming this is on Windows and you want UTF-16.
WinAPI provides a method (MultiByteToWideChar) which can convert pretty much any codepage, as well as UTF-8, to UTF-16.
Or... if you don't care and all you care about is the basic ASCII set... it's a straight 1:1 copy (just converting a 1 byte character to a 2+ byte character):
Yes I am on Windows using wchar_t which is 16bit, but I will also be using this on Linux where wchar_t is 32bit. The good news? I am only READING the data on both systems so there will be no new writing at this point. If I decide to write data and share it, I will write it as 16bit for Windows and easily read that into the Linux equivalent.
Now it is old DOS, not UTF8. I am assuming old ASCII only, and I do not plan on converting it to any other languages at this point. That would be a project in itself, so the ASCII set is fine with me. I can jump any conversion hurdles down the road.
Finally, it looks as though my call to the "assign" method is correct? I am asking because this will be a library and as such I will not be able to test it until after the program which uses this library is also to a certain stage in development.
Finally, it looks as though my call to the "assign" method is correct?
Yes.
Though I would question your use of wchar_t in this library, as chars are typically much easier to work with. Case in point... you just mentioned that wchar_t is different sizes on different platforms... which makes it more difficult to write portable code.
Yes, but everything is UNICODE now. Plus it will allow me to target other languages and such in the future if I need to. Just because it is easy doesn't mean it is right, after all. I honestly don't know anybody not coding UNICODE apps anymore, and I enjoy learning as I go with the new UNICODE stuff.
I know you can use UTF8 with a char array, but I have never come across a UTF8 file that wasn't using some other form of storage. Most of the time it is wchar_t. This particular file was out about the same time Windows 95 was released, so no worries of UTF8 there. If it was a Linux system, maybe. Linux always seems to be ahead of the game.
I know you can use UTF8 with a char array, but I have never come across a UTF8 file that wasn't using some other form of storage.
Different strokes I guess. I've found that UTF-8 is more common pretty much everywhere. Especially in places that had to be 'upgraded' while still maintaining backward compatibility (switching from ANSI to UTF-8 is much easier than switching to UTF-16).
Posix API is an example .. none of it from what I've seen uses UTF-16. Any kind of binary file with embedded comments (zip, png, etc). In fact the only place I can think of where I've seen UTF-16 in widespread use is in WinAPI.
But whatever... it's your code and you can do what you want. =) Don't let me bully you.
I just wanted to answer Cubbi. I have to read the file into a char array due to the file being binary. It contains 3D object data as well as ANSI names for said objects.