how do i convert a unicode number into a char

Jun 5, 2022 at 5:47am
I want to convert a Unicode number into a char, like how you can with an ASCII number, for example, I want to turn Unicode 931 -> Σ;
is there a function in the standard library to do this?
Jun 5, 2022 at 7:04am
The char data type is typically one byte (8 bits) and is too small to store all possible Unicode characters. Most programs use UTF-8 (or UTF-16) to store Unicode characters. With UTF-8 each character would need 1, 2, 3 or 4 bytes to represent one character. The ASCII characters would be represented the same way in UTF-8 and would only take up 1 byte. Characters such as Å, Ä and Ö would require 2 bytes, Chinese characters often require 3 bytes while many emoji characters requires 4 bytes. It all depends on how large the Unicode number is.

Unfortunately I don't think there is any standard C++ functions to convert an Unicode number to UTF-8 but it's not extremely difficult to write such a function yourself if you're familiar with low-level bit manipulation. Wikipedia has a good explanation of how the encoding works: https://en.wikipedia.org/wiki/UTF-8#Encoding (look at the table, it's very good for understanding how it works).

When you have converted it to UTF-8 you are left with the problem of displaying it. This is usually not a problem because UTF-8 is typically the default encoding (except on Windows) so just outputting it with printf or cout usually works.

 
std::cout << "Σ";

This works fine for me on Linux because "Σ" is already UTF-8 encoded and the Terminal knows how to display it.

I'm sure there must be non-standard libraries that can do the conversion that you want to do but I don't know any so I cannot recommend anything unfortunately. Most GUI libraries (and other libraries that are able to draw text to a window) has no problem displaying text encoded in UTF-8 as long as the font used supports those characters.
Last edited on Jun 5, 2022 at 7:26am
Jun 5, 2022 at 2:44pm
Each character that exists in the Unicode standard is unambiguously identified by a so-called code-point. That's what you call a "unicode number". Now, there are many different ways how those Unicode characters (code-points) can be encoded as bytes! Probably the most widely used scheme is UTF-8, but there also are UTF-16 and UTF-32 as well as others. Meanwhile, in the C standards library, the type char is really just a "byte" type. The C standards library is mostly agnostic to a specific character encoding. It's not even guaranteed that char-based strings are Unicode at all; it's quite possible they use something like Latin-1.

On the Windows platform, you can use those functions to convert character encodings:
https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-widechartomultibyte
https://docs.microsoft.com/en-us/windows/win32/api/stringapiset/nf-stringapiset-multibytetowidechar

On Linux you can use libiconv:
https://www.gnu.org/software/libiconv/

__________

Example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
int main()
{
    DWORD bytes_written;
    const wchar_t string_utf16[] = { 931, 0 }; // <-- UTF-16 string (don't forget NULL terminator!)
    char buffer[256]; // <-- buffer for UTF-8 string

    // convert UTF-16 to UTF-8
    int length = WideCharToMultiByte(CP_UTF8, 0, string_utf16, -1, buffer, 256, NULL, NULL);
    if (length <= 0)
    {
        /* handle error! */
    }

    // dump UTF-8 bytes (for informational purposes)
    for (int i = 0; i < length; ++i)
    {
        printf("0x%02X ", (BYTE)buffer[i]);
    }
    puts("");

    // print UTF-8 string to the console
    SetConsoleOutputCP(CP_UTF8);
    WriteConsoleA((HANDLE)_get_osfhandle(_fileno(stdout)), buffer, length, &bytes_written, NULL);
}

Output:
https://i.imgur.com/5AiY2rg.png
Last edited on Jun 5, 2022 at 3:24pm
Topic archived. No new replies allowed.