char, unicode char

Hello,
I need to know how many byte mongolian cyrillic char is.
To do this, I wrote the following code.
1
2
3
4
5
        string d="сайн";
	for(int i=0; i<d.length(); i+=2)
	{
		cout<<d.substr(i,2)<<endl;
	}

Output is:
с
а
й
н
So according the output, I think one mongolian cyrillic char size is 2 bytes.
But when I try this code, there is an error "Initializer string for array is too long".
 
char a[2]="й";

So does it mean that size of mongolian cyrillic char is not 2 bytes?
that means you cannot assign a c string to a char array.

Either so:
char a[2]={'й'};
or so:
const char *a="й";

Usually you need to prepend a char or a string with L in order to get the unicode variant
coder777 wrote:
that means you cannot assign a c string to a char array.

??????

-> http://ideone.com/7ng81B
One extra byte is needed for the null terminator.

Prepending L you would have to work with wide characters wchar_t. You can have unicode with char if you use an encoding like UTF-8. If strings are UTF-8 encoded by default is implementation defined. In C++11 you can tell the string to be UTF-8 encoded by prepending u8. u8"сайн"
ok: that means you cannot assign the c string to the char array.

Because the c string has a terminating 0 (-> 3 chars) and the array allows only 2 chars.
does it mean that size of mongolian cyrillic char is not 2 bytes?

The size of a char is always one byte. But, when you use non-ASCII characters in a string literal (between quotes), in a program, something else is stored.

In your case, it sounds like you're using something like linux, so you're getting UTF-8, where each of these characters indeed happens to be encoded by a two-byte sequence:

1
2
3
4
5
6
7
8
9
10
11
12
#include <iostream>
#include <string>

int main()
{
    std::string d1 = "сайн";
    std::string d2 = "\xd1\x81\xd0\xb0\xd0\xb9\xd0\xbd";
    if(d1 == d2)
        std::cout << "Your development environment uses UTF-8\n";
    else
       std::cout << "Your development environment uses something else\n";
}
online demo: http://ideone.com/6gOZcS

(and yes, as already mentioned, you forgot the null terminator: char a[3]="й"; (equivalent to ={'\xd0', '0xb9', '\0'}; )
Last edited on
Topic archived. No new replies allowed.