Is it a good practice to avoid strlen

Forum

Forum
General C++ Programming
Is it a good practice to avoid strlen

Is it a good practice to avoid strlen

I was looking for some tips and tricks in C++ and saw that you can avoid the strlen or length() with array[i] in the for loop. But is it really a good practice to use the array like that. I mean it will jump out of the for loop if it reaches the end of the array, but we are not questioning if the array has reached the end.

And my second question would be: Do you know good websites to learn good C++ practise? How and when to use templates, lambdas or structs for example.

ne555 (10692)

c-strings are null terminated
other arrays are not.

keskiverto (10366)

We can only guess what you mean with the "array[k] in the for loop". Speculation does not lead to enlightenment.

The functions strlen() and std::char_traits::length() return the length of the null-terminated character sequence.

The implementation could be like:

int strlen( const char* s ) {
  int count = 0;
  while ( *s ) {
    ++count;
    ++s;
  }
  return count;
}

If you iterate over C-string:

const char* data = "Hello";

int len = strlen( data );
for ( int i=0; i < len; ++i ) {
  std::cout << data[i] << '\n';
}

Then you loop over the C-string twice: once to compute length and again to do your work.

Would it make sense to merge the loops? Not necessarily. In the code above it should be very clear that our loop iterates through a C-string, because we see the 'strlen'.

Good C++ practice? Do not use C-string. Use std::string.

std::string data = "Hello";
for ( char c : data ) {
  std::cout << c << '\n';
}

jonnin (11341)

when in C or forced by some requirement to use C strings, you should use strlen. When using std::string, you should use length(). When dealing with raw bytes put into a std::string (better to not do this, use vector of bytes) or arrays of raw bytes, which are not strings, you should not use them: they will not do what you want.

Last edited on

Niccolo (720)

I have one addendum to @jonnin's point - use strnlen instead.

strlen will happily crash if given a "C string" that is not zero terminated. strnlen, if given the correct length of the buffer supplied, will stop before that happens. It is considered safer.

That said, strnlen has wide character counterparts, which means unicode. This brings up the nasty part of strings - they are not all the same thing. unicode is an old fashioned approach to handling multiple languages (and their various symbols). The entire C library of functions was created on the assumption of ASCII, where English was the only expected language. Some extension was made to accommodate a few of he European languages (and their additional letters), but that was not sufficient for the whole world. Eventually it was decided to propose other solutions, among them was unicode, a 16 bit character.

That wasn't the end of it. 16 bit characters offer up to 65K characters, but for the entire world to participate millions would be required. There is a 32 bit version, which Apple uses, but it is very wasteful.

The Internet basically settled the argument with UTF-8, officially one of the unicode options (but not what most compilers and operating systems expect when configured for unicode - they usually assume 16 bits, except for Apple which works with 32 bits).

This gets annoying and dizzying. UTF-8 is a defacto world standard, primarily because of the Internet. The programming confusing with it is based on the fact that there is no 1 to 1 correspondence to the size of a character. The ASCII (English) set fits into UTF-8 identically. An ASCII string is entirely compliant with UTF-8. However, when other languages are used, and their characters extend the space beyond the 7 bit range, UTF-8 moves to 16 bits. When that overflows it extends to 24 bits, or 32 bits....and can, in theory, extend without end.

UTF-16 has flaws and bugs (characters that don't work correctly in certain languages), and has never been fixed. The Internet "settled" on UTF-8 as new languages were added, which coincided with various nations joining the Internet. Those local, native experts in their languages debugged UTF-8 - on the Internet, just to make it work. As a result, UTF-8 is the most complete and correct of all the "standards". Virtually all web pages are UTF-8 as a result.

That may seem like a lot more than you care to know, but stop and think.

That very attitude, by programmers, for decades is why UTF-16 (the typical 'unicode' assumed on Linux and Windows) is buggy, wasteful and incorrect for most languages beyond simple European languages. Decades ago it didn't matter, because only Americans, the British and some French or German people even had computers (back before the 80's). The Russians did their own thing for years, or worked in English.

In the modern era, if you so much as touch the web, you will encounter UTF-8. You're probably using it in your editor for code without realizing it. Limited to English, you will never realize it is a problem. In the connected world you will discover it becomes one.

As such, you may as well learn now, so you don't have to unlearn and then relearn later.

Avoid C style strings for text. They have no actual accommodation of the world's languages. They can't sort non-English text. If you feed the typical C functions non ASCII strings, you are going to blow things up at some point. UTF-8 was fashioned specifically to avoid zeroes in the string, so strnlen will not crash, but it will not tell you want you expect.

There is, as a result, a difference between the number of characters in a string, and the number of bytes in a string. Where UTF-8 might have a string with 10 characters, where 1 of them is a 16 bit non-English character, that string will need 11 bytes. It will be 10 characters long, and require 10 characters of space to print, but it must have 11 bytes for storage. Another string of 10 characters, in that non-English language, may require 13 bytes for storage. It may have 2 (or more) 16 bit characters.

UNIX and Linux (which govern the content of the C library) were never designed to deal with this. Some ideas were tacked on, but that isn't a genuine solution. In case it is not clear, C, and therefore the C library and C style strings, was designed to be the language for writing the UNIX operating system. C was built specifically for that first (and, subsequently, once UNIX was built, applications that run on UNIX). Linux inherited that.

...and that was the 70's. It became nearly carved in stone. What has been tacked on to help cope with the rest of the world is ugly and prone to errors in application work.

Use string classes. They can accommodate the world's languages and sort correctly. They can differentiate between the storage required and the number of characters involved.

Do not fall into the trap perhaps 90% of programmers do - assuming this doesn't matter. The problem is that, for English speaking programmers, it doesn't matter. Until, that is, one puts that code into the world.

As to @zongul's other questions:

Are you referring to the modern for loop?

int a[10]{};

for( auto & n : a )
{
 // each n is an element of a
}

It should be clear that this isn't quite good for C style strings when a is a char array (n must be checked for zero along the way) - UTF-8 multi-byte characters are why. However, this type of loop is safer, as the loop "knows" the size of the container. This is recognizably safer than older style for loops.

structs are classes. They only differ in the default that members are public. It is a relic from C. Many coding standards for C++ insist on avoiding struct except for C legacy code, but use class with explicit private/public usage.

templates were extremely important the moment they were introduced in the 90's (when the compilers could actually compile them). They are for code which adapts to type, which is to say code that is generically applicable to a wide range of uses, like the stl containers. This implies the programmer is capable of writing code worthy of being reused in future applications.

There are web sites with blogs and articles which tutor on these points, but nothing dives into the subject like a book. Any recent books by authors like Josuttis, Stroustrup, Sutter, and others of their reputation, are particularly good sources. Josuttis recent work covering everything of importance on C++17 may be the essential text for you, but it will serve well to have one from Stroustrup (his latest, to my knowledge, is on C++14). You should visit the isocpp.org website to acquaint yourself with the luminaries on the subject.

Topic archived. No new replies allowed.