As far as I know, C++ doesn't even define what the compiler should do if the source contains international characters (for the purposes of this explanation we'll define "international characters" as those that are neither a Latin alphabet letter, an Arabic numeral, nor a puntuation mark). So something like std::cout << std::string("学霸").size();/code] could print pretty much any value, simply because [code]"学霸" doesn't have a well-defined binary translation.
You should avoid writing non-ASCII in sources, unless you understand clearly what it is you're doing.
* What encoding are you using for the source? E.g. UTF-8? UTF-16? Something more esoteric?
* What compiler are you using?
* What does that particular compiler do when it encounters these characters in the given encoding?
* Do you care if other people try to compile your code using a different compiler, that might do something entirely different?
1. Encoding: I'm on Ubuntu 14.04. If I'm not wrong, the default encoding is UTF-8.
2. Compiler: g++. I use it from Konsole. g++ -std=c++0x foo.cpp -o foo
3. Result: It compiles the program.
4. Other computers: I'll love to see what the program returns on other machines.
Note that in C++11 (and later) there are special UTF-8 string literals that you can use by placing u8 in front of the quoted strings, e.g. u8"ĵurnalo".
I don't think there is a built in function for calculating the length of a string encoded in UTF-8 but it's not that hard to write one yourself. If you take a look at the specification for UTF-8 (https://en.wikipedia.org/wiki/Utf8#Description) you see that each byte for a character starts with the bit pattern 10, except for the first byte. That means you can loop through the string and count the bytes that doesn't start with the bit pattern 10 to get the length of the string.
but yes, C++11's wstring_convert as in JLBorges's answer is both more convenient and more portable (mblen and mbsrtowcs rely on an OS facility, std::codecvt_utf8 does not)