funny string to wchar_t conversion

Hi to all mighty rulers of C++ scary and strange world. I need to complete task exceding my power of C++ warrior so I am asking for your precious help.
I have following issue to solve:

I have got a (char*) string like this: "010C00610075002000730076011B007400650021"
It is hexa representation of UTF-16BE encoded string. (in real world it means "Čau světe!" but this is not that important)
The issue is - how to convert this string to wchar_t*. I need to have platform independend solution (or at least to compile it under linux and windows).

Any hint, piece of code or any kind of help from you - kings and mighty rulers of C++ world - will be appreciated and payed back by writing "Thanks" under your post. ;)
There isn't a single bit of code that will convert the string correctly for both Windows and Linux as wchar_t is UTF-16LE on Windows and UTF-32 (don't know if endianness is important here?) on Linux.

In the Windows case, it should just be a case of swapping the byte pairs for the string to become a wchar_t, but I've not actually tried this. And I don't know how to convert from UTF-16 to UTF-32, apart from by using a library.

One commonly used string format/encoding conversion library is libiconv. It's from the Linux world, but there is a WIN32 port.
http://www.gnu.org/s/libiconv/
http://gnuwin32.sourceforge.net/packages/libiconv.htm

As you are converting from UTF-16 anyway, you might want to consider using UTF-8 on Linux, as that appears to be the preferred encoding these days.

But Windows internal APIs are still all UTF-16, as far as I know, so I'd prob. use UTF-16LE for that platform.

Do you actually have to use whar_t in your code? Or could you work with a typedef?

If you don't care about surrogate pairs (codepoints over U+FFFF), this is remarkably easy. All you have to do is mind the endianness.

Here's a simple way to do it with strings:

1
2
3
4
5
6
7
8
9
10
11
12
wstring Convert(const char* original,unsigned size)
{
  // 'size' is the number of UTF16BE codepoints (ie:  'original' is actually 2*size chars)
  wstring out;
  out.reserve(size);

  for(unsigned i = 0; i < size; ++i, original += 2)
  {
    out.push_back( ((original[0] << 8) & 0xFF00) | (original[1] & 0x00FF) );
  }
  return out;
}
Last edited on
Thank you guys for answers.

2Dish:
If you don't care about surrogate pairs...

I don't know what characters are represented by codepoints over U+FFFF so i don't know if i care about them. I need to represent correctly at least all midle europe characters (like ščřžáéąęłżźöü...and some others). Do you know if they have codepoints under U+FFFF?

2andywestken:
One commonly used string format/encoding conversion library is libiconv.

I will definitely think about using the library.

Do you actually have to use whar_t in your code?

No, I don't. Probably i can rewrite parts of code to use other string type. But i need to write this hexa UTF16BE string after conversion to text file (I use wofstream so wchar_t seemed good for it).
I don't know what characters are represented by codepoints over U+FFFF so i don't know if i care about them. I need to represent correctly at least all midle europe characters (like ščřžáéąęłżźöü...and some others). Do you know if they have codepoints under U+FFFF?


Yes. All Latin characters are under U+FFFF. The codepoints over U+FFFF are all pretty obscure.
Great, thanks a lot.
For now I will try the Dishes solution and if I will ever need other characters too or have any problem with it than i will use the liberary andywestken posted.
Topic archived. No new replies allowed.