About standard strings in Linux?

Forum

Forum
UNIX/Linux Programming
About standard strings in Linux?

About standard strings in Linux?

I'm new to Linux dev, so far I've been mostly writing Win32 applications where `char` is `ANSI` and `wchar_t` is UNICODE.

But I see under Linux the default is `UTF-8` and also hear that `wchar_t` is not used.

So my question is is `char` in Linux same thing as `char` in Windows?
I mean, same size and used for ANSI strings?

Or is `char` implicitly UTF-8 in Linux?
if not, then how exactly do you write applications in Linux to handle `UTF-8` strings?
Do I need to use char16_t in Linux?

---

Bonus question, if I need to use char16_t, then what method exist to convert from ANSI to UTF-8 and vice versa?

coder777 (8442)

Take a look at this:

https://www.man7.org/linux/man-pages/man7/UTF-8.7.html

Yes: char is used for UTF-8. Note that utf-8 uses multiple bytes. I.e. one output character might have up to 5 bytes within the string.

To cope with this multi bytes you may use a lib like this:

https://github.com/nemtrif/utfcpp

To convert utf-8 to other UNICODE format you may use this:

https://en.cppreference.com/w/cpp/locale/codecvt_utf8

'ANSI' does not really exists outside the windows world...

Peter87 (11194)

So my question is is `char` in Linux same thing as `char` in Windows?

char is the same size yes. 8 bits.

I mean, same size and used for ANSI strings?

By ANSI you mean the same as ASCII?

UTF-8 is compatible with 7-bit ASCII in the sense that ASCII text is valid UTF-8 so a program that assumes text is UTF-8 will automatically be able to handle ASCII.

The opposite, passing a UTF-8 string as input to a program that uses ASCII can still often work as long as it doesn't reject char values that use more than 7 bits or try to display the string. Counting the characters and the categorization of character into whitespace, letters, symbols, etc. are things that could go wrong.

Or is `char` implicitly UTF-8 in Linux?

You could obviously use it to store other encodings than UTF-8 even on Linux (some libraries might have special functions for different encodings) but in general when dealing with the file system or printing something to the console etc. it will be treated as UTF-8.

This typically "just works" on Linux:

std::cout << "Hallå! Trevligt att träffas!\n";

You could even print Chinese characters and such but to display it properly the font that is used also needs to support it. I think this is generally less of a problem nowadays than it used to be unless you use custom fonts.

how exactly do you write applications in Linux to handle `UTF-8` strings?

Usually you don't need to do anything special. It's the default. Just be aware that one char does not necessarily mean one character. strlen and std::string::length gives you the number of char values (not characters).

1
2

std::string str = "ÅÄÖ";
std::cout << str.length() << "\n"; // prints 6

Do I need to use char16_t in Linux?

No, not unless you want to use UTF-16 for some reason but I don't think there is much support for that in the standard/system libraries.

Last edited on

Peter87 (11194)

coder777 wrote:
To convert utf-8 to other UNICODE format you may use this: https://en.cppreference.com/w/cpp/locale/codecvt_utf8

Note that the codecvt functions are deprecated and are planned to get removed in C++26.

Unfortunately standard library support for Unicode is still not great. The fact that the standard library wants us to use char8_t for UTF-8 while the rest of the real world use char is also a pain.

Last edited on

kigar64551 (803)

In general, the encoding of a char, and thus the encoding of a std::string, which is just a wrapper for a sequence of char's, is simply undefined. It stores whatever "bytes" you put into it! Most string functions are agnostic of a specific encoding. For example, strlen() (or std::string::length()) just counts the number of bytes before the first NUL (0x00) byte. Therefore, for multi-byte character encodings, such as UTF-8, strlen() returns the length in bytes, rather than computing the actual number of encoded characters.

Now, things get interesting (messy) when you receive strings from an "external" source, or when you pass strings to an "external" destination. That's because, at this point, you need to agree on a specific encoding with the "external" entity. One such situation is when you read strings from a file, or when you write strings to a file. Here you need to know which encoding is stored in the file, or which encoding will be expected by whoever is going to read your file. Another important situation is when you call OS functions that deal with strings!

On Windows, the Win32 API, has two variants of functions that deal with strings: One "ANSI" (char*) variant and one "Unicode" (wchar_t*) variant. The "ANSI" variants of the Win32 API functions expect or return strings in whatever multi-byte character encoding (ANSI Codepage) that happens to be configured on your system. It's usually something like Windows-1252 (Latin-1) on systems in the "Western" world, but could be something entirely different, even UTF-8. Note that support for UTF-8 in the "ANSI" APIs is a relatively new invention in Windows! Meanwhile, the "Unicode" Win32 API functions expect or return Unicode strings, always using the UTF-16 (UCS-2) encoding.

In Linux, OS (kernel) functions generally use the char* type for passing around strings. But the Linux kernel developers are very reluctant to assume or enforce any specific character encoding. So, in Linux, most, if not all, OS (kernel) functions that deal with strings in some way are again agnostic of a particular character encoding! For example, in Linux, a file name is simply defined as a sequence of non-NUL bytes. The Linux kernel therefore leaves it up to applications or the specific file-system implementation to deal with the details... 🙄

To make things even more complicated, many locale-aware programs or libraries use the so-called "locale" to deal with text input/output. It is configured with environment variables, like LANGUAGE, LC_xxx and LANG. The "locale" covers a bunch of other things, such as the formatting of numbers and the time/date format to be used, but it also includes the character set. Most commonly, UTF-8 is used these days.

https://www.gnu.org/software/gettext/manual/html_node/Locale-Environment-Variables.html
https://www.baeldung.com/linux/locale-environment-variables

Not necessarily. As pointed out above, the Linux kernel and the syscalls that it provides are agnostic of a particular character encoding as much as possible. Meanwhile, most locale-aware programs or libraries, including the terminal emulator, will probably use or assume the character encoding that is indicated by the active "locale" – most commonly UTF-8 these days, but this can not be relied upon.

Last edited on

malibor (660)

Thank you guys for valuable inputs, I get it, so I simply use std::string to handle UTF-8 which is great.

coder777 wrote:

To convert utf-8 to other UNICODE format you may use this:

https://en.cppreference.com/w/cpp/locale/codecvt_utf8

Peter87 wrote:

The codecvt functions are deprecated and are planned to get officially removed from the standard in C++26.

For conversion from one to other unicode string I've been using std::c8rtomb,std::c32rtomb and std::c16rtomb, these are not deprecated.

Do you think they're good alternatives to deprecated API's from <codecvt> header and external libraries like the one which coder777 suggested above?

I'm fan of using as few external libraries as possible that's why I'm asking this.

I already have functions using these standard functions but I didn't fully need them and my tests are limited so I'm no aware of any pitfalls using them compared to some tested library.

malibor (660)

@kigar64551

Sorry I missed your post because we posted somewhere in same time...
What you said is very useful info for a beginner like me, read and understood, thanks!

kigar64551 (803)

On Linux, iconv() gives you better control over the conversion:

It's not technically an "external" library, because it is provided by glibc (the C runtime library), which you'll use anyway 😏

#include <stdint.h>
#include <stdio.h>
#include <iconv.h>
#include <uchar.h>
#include <string.h>

static int utf16_to_utf8(char *const out_buff, const size_t out_capacity, const char16_t *const in_buff, const size_t in_length)
{
        if ((!out_buff) || (!out_capacity) || (!in_buff) || (!in_length)) {
                if (out_buff && out_capacity) {
                         *out_buff = '\0';
                }
                return 0;
        }

        iconv_t utf16_utf8 = iconv_open("UTF-8", "UTF-16LE");
        if (utf16_utf8 == (iconv_t)-1) {
                *out_buff = '\0';
                return 1;
        }

        size_t in_bytes = in_length * sizeof(char16_t), out_bytes = out_capacity;
        char *in_addr = (char*)in_buff, *out_addr = out_buff;
        size_t result = iconv(utf16_utf8, &in_addr, &in_bytes, &out_addr, &out_bytes);

        iconv_close(utf16_utf8);

        if (result) {
                *out_buff = '\0';
                return 1;
        }

        /* Make sure output is NUL-terminated! */
        if (out_bytes > 0U) {
                if ((out_addr == out_buff) || (out_addr[-1] != '\0')) {
                        *out_addr = '\0';
                }
        } else {
                out_buff[out_capacity - 1U] = '\0';
        }

        return 0;
}

int main()
{
        static const char16_t utf16_data[] = { 0x0053, 0x0063, 0x0068, 0x00F6, 0x0070, 0x0066, 0x0067, 0x0065, 0x0066, 0x00E4, 0x00DF, 0x0020, 0xD83C, 0xDF4C, 0x0000 };
        char utf8_buffer[256U];

        if (utf16_to_utf8(utf8_buffer, sizeof(utf8_buffer), utf16_data, sizeof(utf16_data)/sizeof(char16_t))) {
                fputs("Failed to convert!\n", stderr);
                return -1;
        }

        for (size_t i = 0U; i < sizeof(utf8_buffer); ++i) {
                printf((i > 0U) ? ", 0x%02X" : "0x%02X", utf8_buffer[i]);
                if (!utf8_buffer[i]) {
                        break;
                }
        }

        printf("\n\"%s\"\n", utf8_buffer);

        return 0;
}

$ ./a.out
0x53, 0x63, 0x68, 0xC3, 0xB6, 0x70, 0x66, 0x67, 0x65, 0x66, 0xC3, 0xA4, 0xC3, 0x9F, 0x20, 0xF0, 0x9F, 0x8D, 0x8C, 0x00
"Schöpfgefäß 🍌"

Last edited on

malibor (660)

@kigar64551
I'm very grateful for your sample code and mention of iconv!
This is similar to WideCharToMultiByte and MultiByteToWideChar in Windows so I'm already copying and adapting the code with #ifdef _WIN32 #elif defined __linux__ to handle both cases, yeah!

Thanks!

Topic archived. No new replies allowed.