Function to determine if a string is made by all distinct characters

Hello all! Here is my idea: sort a string s, introduce an iterator goes through the whole string, and check if for any given character, the character is equal to the next. If so, the string has some duplicate characters; if not, the string is made by all unique characters. I have called my function "is_unique". However, when I run the program, it always gives the value False, which means that for some reason the algorithm reads all the sequences as if they have duplicate characters, even when they don't. So there should be something wrong, but I cannot see what exactly. Here is the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
  #include <iostream>
#include <algorithm>
#include <string>

using namespace std;

    bool is_unique(string& str)
    {
        sort(str.begin(), str.end());

        typedef string::const_iterator iter;
        iter i = str.begin();

        while (i != str.end())
        {
            if (*i == *(i++))
            return false;
        }
            return true;
    }

    int main()
    {
        cout << "please enter a string " << endl;
        cout << "followed by end-of-file"  << endl;

             string s;
            (cin >> s);
             cout << "Is the string made by unique characters? " << is_unique(s) << endl;

        return 0;
    }
Line 16 if (*i == *(i++)) is problematic for at least two reasons:

1. You don't know if *i or i++ will be evaluated first so it is possible that you are comparing the same character to itself.

2. After you have incremented the iterator you need to check if you have reached the end of the string. It is not valid to dereference the end iterator.
I have made the following changes:
1) pre-increment instead of post-increment
2) change the loop bound to str.end()-1

And all seems to work fine, now.
Claudius7 wrote:
1) pre-increment instead of post-increment

You still have the same problem that you don't know in which order *i and ++i will be evaluated. The compiler is allowed to evaluate *(++i) before evaluating *i. To be safe I recommend moving the increment outside of the condition.

Claudius7 wrote:
2) change the loop bound to str.end()-1

Note that this will not work if the string is empty.
Last edited on
Hi, thanks for your comment. But how could I move the increment outside the condition? I'm afraid that it will screw the entire code, so perhaps can you explain it ? Thank you again
Why don't you use std::unique_copy to copy the original string to a second one. If they are equal than all characters are unique.
http://www.cplusplus.com/reference/algorithm/unique_copy/

1
2
3
4
5
6
7
8
bool is_unique(string& str)
{
  sort(str.begin(), str.end());
  string tmp;
  unique_copy(str.begin(), str.end(), back_inserter(tmp));

  return str == tmp;
}
Excellent, thank you very much. Much simpler solution.
Claudius7 wrote:
Hi, thanks for your comment. But how could I move the increment outside the condition?

You could have incremented the the iterator before the if statement and stored the characters before and after in variables.

1
2
3
4
char a = *i;
++i;
char b = *i;
if (a == b)


When working with iterators it often makes sense to use for loops. You could have used +1 on the iterator to get an iterator to the next element.

1
2
3
4
5
for (iter it = str.begin(); it != str.end() - 1; ++it)
{
	if (*it == *(it + 1))
		return false;
}



Claudius7 in response to the suggestion to use unique_copy wrote:
Excellent, thank you very much. Much simpler solution.

If you use std::unique instead of std::unique_copy you don't need to create a temporary object.

1
2
3
4
5
bool is_unique(string& str)
{
	sort(str.begin(), str.end());
	return unique(str.begin(), str.end()) == str.end();
}

Using std::adjacent_find is a bit more efficent because it doesn't have to go through the whole array after it finds two adjacent elements that are equal.

1
2
3
4
5
bool is_unique(string& str)
{
	sort(str.begin(), str.end());
	return adjacent_find(str.begin(), str.end()) == str.end();
}


Note that the function modifies the string that is passed to the function. That's not what one would expect from the function name. Maybe it would have been better to pass the string by value.

1
2
3
4
5
bool is_unique(string str)
{
	sort(str.begin(), str.end());
	return adjacent_find(str.begin(), str.end()) == str.end();
}

You can still prevent the string from being copied, in case you don't care if it's being modified, by using std::move when calling the function.

 
cout << "Is the string made by unique characters? " << is_unique(std::move(s)) << endl;
Last edited on
OP's version of what it means for a string to be considered 'unique':

check if for any given character, the character is equal to the next. If so, the string has some duplicate characters; if not, the string is made by all unique characters.


So "lovely" would be considered 'unique' as per this definition as the two 'l's are non-adjacent, and hence this definition should preclude any application of std::sort

OP: on another note:
1
2
 string s;
            (cin >> s);

would stumble on any whitespace, use getline() instead
gunnerfunner: in my definition, before the statement you quoted, I clearly added that the string has already been sorted. Which is in fact the first thing that my function does.
yeah, you're right, i missed that
1
2
3
4
5
#include <string>
#include <unordered_set>

bool is_unique( const std::string& str ) // O(N) time, O(N) space
{ return std::unordered_set<char>( str.begin(), str.end() ).size() == str.size() ; }

The original algorithm works but it's very inefficient because it has to sort the entire string before it starts checking.

This method doesn't sort the string and returns when it sees the first non-unique character.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include<iostream>
#include <bitset>
#include <limits>
using std::string;
using std::cin;
using std::cout;
using std::bitset;

bool uniqueChars(const string &str)
{
    bitset<std::numeric_limits<unsigned char>::max()> seen;
    for (unsigned char ch : str) {
	if (seen.test(ch)) {
	    return false;
	}
	seen.set(ch);
    }
    return true;
}


int
main()
{
    string str;
    while (getline(cin, str)) {
	cout << (uniqueChars(str) ? "unique\n" : "not unique\n");
    }
}
dhayden: your method is v interesting in its application of the bitset variable. I tried reading up the test() and set() methods but the full working of the program is still not entirely clear to me. Would you mind writing up a couple of lines by way of explanation how the program does what it does? Thanks
A bitset is a conceptually like vector<bool>: it's a collection of boolean values. But it's implemented to conserve space: each boolean value is represented by a single bit.

The test() method tests whether the n'th bit is set. The set() method sets that bit. The default constructor sets all bits to false, which is important for this code.

The code iterates through each character in the string and says "if I've already seen this character then return false. Otherwise indicate that I've seen this character." It keeps track of whether a character has been seen by setting the bit in the bitset that corresponds to the ascii value of the character.
bitset<std::numeric_limits<unsigned char>::max()> seen;
This is the portable and somewhat pedantic way of saying
bitset<256> seen;
There are 256 possible characters on just about every computer, but using the numeric_limits thing makes it explicit and portable.

One final note, it's important to use unsigned char rather than char in the numeric_limits and in the declaration of ch. If chars are signed on the computer then they can have negative values which won't work in the bitset functions.

If performance is the primary concern, do not use bitset.
Note: the number of bits should be std::numeric_limits<unsigned char>::max() + 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <bitset>
#include <limits>
#include <string>

bool unique_chars_bitset( const std::string& str ) {
       
    // incorrect and even more expensive (need to check for and throw out_of_range)
    constexpr std::size_t N = std::numeric_limits<unsigned char>::max() ;
    std::bitset<N> seen ;

    for( unsigned char ch : str ) {
            
        if( seen.test(ch) ) return false;
        seen.set(ch);
    }
    
    return true;
}

Assembly: https://godbolt.org/g/gYJjW1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <bitset>
#include <limits>
#include <string>

bool unique_chars_bitset( const std::string& str ) {
       
    // +1: correct and not as expensive with a clever optimise 
    // (the optimiser can recognise that no exceptions need to be thrown)
    constexpr std::size_t N = std::numeric_limits<unsigned char>::max() + 1 ;
    std::bitset<N> seen ;

    for( unsigned char ch : str ) {
            
        if( seen.test(ch) ) return false;
        seen.set(ch);
    }
    
    return true;
}

Assembly: https://godbolt.org/g/XTew94

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#include <limits>
#include <string>

bool unique_chars_array( const std::string& str ) noexcept
{
    // +1 : correct and efficient (without the +1, may engender undefined behaviour)
    constexpr std::size_t N = std::numeric_limits<unsigned char>::max() + 1 ;
    int seen[N] {} ;

    for( unsigned char ch : str ) {
            
        if( seen[ch] ) return false;
        seen[ch] = true ;
    }
    
    return true;
}

Assembly: https://godbolt.org/g/KrLpJh
Thank you all, dhyayden and JLBorges, very interesting and informative discussion! :)
Last edited on
The difference in performance doesn't seem to be huge but the memory usage is.
The bitset will only need to use 32 bytes while the int array needs 1024 bytes (assuming sizeof(int) == 4).
> The difference in performance doesn't seem to be huge

Assuming noexcept throughout, a bitset is typically about twice as slow as an array or vector of int.

> The bitset will only need to use 32 bytes while the int array needs 1024 bytes (assuming sizeof(int) == 4).

The difference in performance is important only if the function is called a large number of times.
If it is called N times, the difference in performance is O(N)

The temporary memory used by the function is O(1) (memory for locals is immediately reclaimed when the function returns)

If the stack is too small to accommodate an object of 1K size, on intel architectures an array of bool flags (256 bytes) would give performance comparable to array of int flags.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
#include <bitset>
#include <limits>
#include <string>
#include <vector>
#include <fstream>
#include <algorithm>
#include <random>
#include <ctime>
#include <iostream>
#include <cassert>

bool unique_chars_bitset( const std::string& str ) {

    // +1: correct and not as expensive with a clever optimise
    // (the optimiser can recognise that no exceptions need to be thrown)
    constexpr std::size_t N = std::numeric_limits<unsigned char>::max() + 1 ;
    std::bitset<N> seen ;

    for( unsigned char ch : str ) {

        if( seen.test(ch) ) return false;
        seen.set(ch);
    }

    return true;
}

bool unique_chars_array_i( const std::string& str ) noexcept
{
    // +1 : correct and efficient (without the +1, may engender undefined behaviour)
    constexpr std::size_t N = std::numeric_limits<unsigned char>::max() + 1 ;
    int seen[N] {} ;

    for( unsigned char ch : str ) {

        if( seen[ch] ) return false;
        seen[ch] = true ;
    }

    return true;
}

bool unique_chars_array_b( const std::string& str ) noexcept
{
    // +1 : correct and efficient (without the +1, may engender undefined behaviour)
    constexpr std::size_t N = std::numeric_limits<unsigned char>::max() + 1 ;
    bool seen[N] {} ;

    for( unsigned char ch : str ) {

        if( seen[ch] ) return false;
        seen[ch] = true ;
    }

    return true;
}

std::vector<std::string> get_strings( std::size_t N )
{
    std::string str ;
    constexpr std::size_t SZ = std::numeric_limits<unsigned char>::max() ;
    for( unsigned char c = 0 ; c < SZ ; ++c ) str += c ;

    return { N, str } ;
}

int main()
{
    std::size_t n = 0 ;

    const auto seq = get_strings( 1'000'000 ) ;
    clock_t bitset_ticks = 0 ;
    clock_t array_ticks = 0 ;


    {
        const auto start = std::clock() ;
        for( const std::string& str : seq ) n += unique_chars_bitset(str) ;
        const auto end = std::clock() ;
        std::cout << "     bitset: " << ( bitset_ticks = end-start ) << " clock ticks\n" ;
    }

    {
        const auto start = std::clock() ;
        for( const std::string& str : seq ) n += unique_chars_array_i(str) ;
        const auto end = std::clock() ;
        std::cout << " array(int): " << ( array_ticks = end-start ) << " clock ticks\n" ;
    }

    {
        const auto start = std::clock() ;
        for( const std::string& str : seq ) n += unique_chars_array_b(str) ;
        const auto end = std::clock() ;
        std::cout << "array(bool): " << end-start << " clock ticks\n" ;
    }

    std::cout << "bitset is slower than array(int) by: " << (bitset_ticks-array_ticks)*100.0 / array_ticks << "%\n" ;
    assert( n == seq.size() * 3 ) ;
    return 0 ;
}

On X64
clang++: bitset is slower than array(int) by: 74%
g++ -O3: bitset is slower than array(int) by: 106%
Microsoft: bitset is slower than array(int) by: 129%
http://coliru.stacked-crooked.com/a/d1244a679a104753
http://rextester.com/SFWAN37914
OK, testing for the worst case scenario (long strings with no duplicates) it clearly pays off using an array.
For shorter strings (10-20 characters long) the bitset seems to be faster, probably because the array version still has to initialize the whole array but the function doesn't run long enough for it to pay off.
Topic archived. No new replies allowed.