Function for the Number of Chars in a Un

Forum

Forum
Beginners
Function for the Number of Chars in a Un

Function for the Number of Chars in a Unicode String

Jun 1, 2016 at 5:29am

We all know that string.size() and string.length() give us the number of bytes in a string.

But do we have a function that counts and returns the number of characters in a string?

Perhaps, this piece of code will help clear the question:

#include <iostream>
#include <string>

using namespace std;

int main() {
  string esperanto = "ĵurnalo"; // ĵurnalo = magazine
  string russian = "журнал"; 	// журнал = magazine
  string punjabi = "ਮੈਗਜ਼ੀਨ"; 	// ਮੈਗਜ਼ੀਨ = magazine
  string korean = "일보"; 	// 일보 = newspaper
  string chinese = "学霸";	// 学霸 = nerd (or studious)

  cout << "esperanto.size() = " << esperanto.size() << " (not 7)" << endl;
  cout << "russian.size() = " << russian.size() << " (not 6)" <<  endl;
  cout << "punjabi.size() = " << punjabi.size() << " (not 7)" << endl;
  cout << "korean.size() = " << korean.size() << " (not 2)" << endl;
  cout << "chinese.size() = " << chinese.size() << " (not 2)" << endl;

  return 0;
}

Edit & run on cpp.sh

But when I translate this code into Python, it somehow happens to return the number of characters.


esperanto = "ĵurnalo" # ĵurnalo = magazine
russian = "журнал"    # журнал = magazine
punjabi = "ਮੈਗਜ਼ੀਨ"     # ਮੈਗਜ਼ੀਨ = magazine
korean = "일보"       # 일보 = newspaper
chinese = "学霸"      # 学霸 = nerd (or studious)

print ("len(esperanto) = ", len(esperanto))
print ("len(russian) = ", len(russian))
print ("len(punjabi) = ", len(punjabi))
print ("len(korean) = ", len(korean))
print ("len(chinese) = ", len(chinese))

You can run the C++ code here: http://cpp.sh/2btea and the Python code here: http://goo.gl/GcSprs

Last edited on Jun 1, 2016 at 5:30am

Jun 1, 2016 at 6:07am

helios (17607)

As far as I know, C++ doesn't even define what the compiler should do if the source contains international characters (for the purposes of this explanation we'll define "international characters" as those that are neither a Latin alphabet letter, an Arabic numeral, nor a puntuation mark). So something like std::cout << std::string("学霸").size();/code] could print pretty much any value, simply because [code]"学霸" doesn't have a well-defined binary translation.

You should avoid writing non-ASCII in sources, unless you understand clearly what it is you're doing.
* What encoding are you using for the source? E.g. UTF-8? UTF-16? Something more esoteric?
* What compiler are you using?
* What does that particular compiler do when it encounters these characters in the given encoding?
* Do you care if other people try to compile your code using a different compiler, that might do something entirely different?

Last edited on Jun 1, 2016 at 6:11am

Jun 1, 2016 at 6:29am

verdastelo (13)

Thank you, Helios, for answering.

1. Encoding: I'm on Ubuntu 14.04. If I'm not wrong, the default encoding is UTF-8.
2. Compiler: g++. I use it from Konsole. g++ -std=c++0x foo.cpp -o foo
3. Result: It compiles the program.
4. Other computers: I'll love to see what the program returns on other machines.

Last edited on Jun 1, 2016 at 6:29am

Jun 1, 2016 at 7:58am

Peter87 (11251)

Note that in C++11 (and later) there are special UTF-8 string literals that you can use by placing u8 in front of the quoted strings, e.g. u8"ĵurnalo".

I don't think there is a built in function for calculating the length of a string encoded in UTF-8 but it's not that hard to write one yourself. If you take a look at the specification for UTF-8 (https://en.wikipedia.org/wiki/Utf8#Description) you see that each byte for a character starts with the bit pattern 10, except for the first byte. That means you can loop through the string and count the bytes that doesn't start with the bit pattern 10 to get the length of the string.

#include <iostream>
#include <string>

std::size_t strlen_utf8(const std::string& str) {
	std::size_t length = 0;
	for (char c : str) {
		if ((c & 0xC0) != 0x80) {
			++length;
		}
	}
	return length;
}

using namespace std;

int main() {
	string esperanto = u8"ĵurnalo"; // ĵurnalo = magazine
	string russian = u8"журнал"; 	// журнал = magazine
	string punjabi = u8"ਮੈਗਜ਼ੀਨ"; 	// ਮੈਗਜ਼ੀਨ = magazine
	string korean = u8"일보"; 	// 일보 = newspaper
	string chinese = u8"学霸";	// 学霸 = nerd (or studious)

	cout << "esperanto.size() = " << esperanto.size() << " (not 7)" << endl;
	cout << "russian.size() = " << russian.size() << " (not 6)" <<  endl;
	cout << "punjabi.size() = " << punjabi.size() << " (not 7)" << endl;
	cout << "korean.size() = " << korean.size() << " (not 2)" << endl;
	cout << "chinese.size() = " << chinese.size() << " (not 2)" << endl;

	cout << "strlen_utf8(esperanto) = " << strlen_utf8(esperanto) << endl;
	cout << "strlen_utf8(russian) = " << strlen_utf8(russian) <<  endl;
	cout << "strlen_utf8(punjabi) = " << strlen_utf8(punjabi) << endl;
	cout << "strlen_utf8(korean) = " << strlen_utf8(korean) << endl;
	cout << "strlen_utf8(chinese) = " << strlen_utf8(chinese) << endl;

	return 0;
}

Edit & run on cpp.sh

Last edited on Jun 1, 2016 at 8:07am

Jun 1, 2016 at 9:10am

JLBorges (13770)

Or (with error handling):

1
2
3

// throws std::range_error if utf8 is not a well-formed utf8-encoded string
std::size_t utf8_len( const std::string& utf8 ) 
{ return std::wstring_convert< std::codecvt_utf8<char32_t>, char32_t >{}.from_bytes(utf8).size(); }

http://coliru.stacked-crooked.com/a/16db0c34a24b1dfe

GNU implementation: A fairly recent version, one that actually has the standard header <codecvt>, is required.

Jun 1, 2016 at 1:03pm

verdastelo (13)

Thank you, Peter87, for your reply. You're a genius! Your answer gives me new things to learn.

Thanks to JLBorges for his response! ^_^

Jun 1, 2016 at 2:23pm

Cubbi (4774)

In C89, we did this with mblen ( http://en.cppreference.com/w/cpp/string/multibyte/mblen has an example of UTF-8 string length ), it required writing a loop

In C++98 (and in C as of C95), this was done in single step (no loop) with mbsrtowcs ( http://en.cppreference.com/w/cpp/string/multibyte/mbsrtowcs has an example of UTF-8 length too, look for "int len = ..." )

but yes, C++11's wstring_convert as in JLBorges's answer is both more convenient and more portable (mblen and mbsrtowcs rely on an OS facility, std::codecvt_utf8 does not)

Last edited on Jun 1, 2016 at 2:30pm

Topic archived. No new replies allowed.