how do i convert a unicode number into a

Forum

Forum
Beginners
how do i convert a unicode number into a

how do i convert a unicode number into a char

I want to convert a Unicode number into a char, like how you can with an ASCII number, for example, I want to turn Unicode 931 -> Σ;
is there a function in the standard library to do this?

Peter87 (11256)

The char data type is typically one byte (8 bits) and is too small to store all possible Unicode characters. Most programs use UTF-8 (or UTF-16) to store Unicode characters. With UTF-8 each character would need 1, 2, 3 or 4 bytes to represent one character. The ASCII characters would be represented the same way in UTF-8 and would only take up 1 byte. Characters such as Å, Ä and Ö would require 2 bytes, Chinese characters often require 3 bytes while many emoji characters requires 4 bytes. It all depends on how large the Unicode number is.

Unfortunately I don't think there is any standard C++ functions to convert an Unicode number to UTF-8 but it's not extremely difficult to write such a function yourself if you're familiar with low-level bit manipulation. Wikipedia has a good explanation of how the encoding works: https://en.wikipedia.org/wiki/UTF-8#Encoding (look at the table, it's very good for understanding how it works).

When you have converted it to UTF-8 you are left with the problem of displaying it. This is usually not a problem because UTF-8 is typically the default encoding (except on Windows) so just outputting it with printf or cout usually works.

std::cout << "Σ";

This works fine for me on Linux because "Σ" is already UTF-8 encoded and the Terminal knows how to display it.

I'm sure there must be non-standard libraries that can do the conversion that you want to do but I don't know any so I cannot recommend anything unfortunately. Most GUI libraries (and other libraries that are able to draw text to a window) has no problem displaying text encoded in UTF-8 as long as the font used supports those characters.

Last edited on

kigar64551 (842)

Each character that exists in the Unicode standard is unambiguously identified by a so-called code-point. That's what you call a "unicode number". Now, there are many different ways how those Unicode characters (code-points) can be encoded as bytes! Probably the most widely used scheme is UTF-8, but there also are UTF-16 and UTF-32 as well as others. Meanwhile, in the C standards library, the type char is really just a "byte" type. The C standards library is mostly agnostic to a specific character encoding. It's not even guaranteed that char-based strings are Unicode at all; it's quite possible they use something like Latin-1.

On the Windows platform, you can use those functions to convert character encodings:
https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte
https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar

On Linux you can use libiconv:
https://www.gnu.org/software/libiconv/

__________

Example:

int main()
{
    DWORD bytes_written;
    const wchar_t string_utf16[] = { 931, 0 }; // <-- UTF-16 string (don't forget NULL terminator!)
    char buffer[256]; // <-- buffer for UTF-8 string

    // convert UTF-16 to UTF-8
    int length = WideCharToMultiByte(CP_UTF8, 0, string_utf16, -1, buffer, 256, NULL, NULL);
    if (length <= 0)
    {
        /* handle error! */
    }

    // dump UTF-8 bytes (for informational purposes)
    for (int i = 0; i < length; ++i)
    {
        printf("0x%02X ", (BYTE)buffer[i]);
    }
    puts("");

    // print UTF-8 string to the console
    SetConsoleOutputCP(CP_UTF8);
    WriteConsoleA((HANDLE)_get_osfhandle(_fileno(stdout)), buffer, length, &bytes_written, NULL);
}

Output:
https://i.imgur.com/5AiY2rg.png

Last edited on

Topic archived. No new replies allowed.