Best test for duplicate values?

Pages: 12
I am writing a program that requires a function to test for duplication of values in a vector.
I have a vector of nine integer values. What I want is to test the entire vector to see if any of the values are duplicated in the vector. If there is duplication, one of the values should be replaced with a random integer, and I can recursively check the vector until all values are unique.

Is there a function in the STL that does something like this. I'm too new at this to know off hand. My work around is to write a function that accepts the vector as a parameter, sorts the vector using <algorithm> sort, then resoze it after using <algorithm> unique to see if the size changes. And then I still have to keep track of the order of the original vector... seems to me there should be a more efficient solutions.

Example:
input vector has values {1,12,8,1,7,15,20,9,12}
output vector after function {17,12,8,1,7,15,20,9,2}

where the boldfaced number have been randomly modified.
Any suggestions for a novice?
Something like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void uniqueify(std::vector<int>& v)
{
	typedef std::vector<int>::size_type v_sz_t ;
	typedef std::set<int>::size_type s_sz_t ;
	std::set<int> processed ;

	processed.insert(v[0]) ;
	for ( v_sz_t i=1;  i < v.size() ; ++i )
	{
		s_sz_t n = processed.size() ;

		int value = v[i] ;
		processed.insert(value); 

		while (n == processed.size())
		{
			value = get_rand() ;
			processed.insert(value) ;
		}

		v[i] = value ;
	}
}


Obviously "get_rand()" should be replaced by your method of generating a random number. Small caveat, it's possible duplicates early in the vector could be replaced by random numbers that are already in the vector further on. In that case, the later number is also changed, so it doesn't necessarily preserve the value of your non-duplicates. That would require more work.

I would create a array of the same size and copy the vector to the array one number at a time. Then, as it copies the numbers, it will check the array's previous positions and replaces it with a random number if it comes across a copy.

Rough code.
1
2
3
4
5
6
7
8
9
10
int dup[9];
for(int copy = 0; copy < 9; copy++)
{
  dup[copy] = vec[copy]
  for(int check = 0; check < copy; check++)
  {
    if(dup[copy] == dup[check])
      vec[copy] = rand();
  }
}
Thank you for the replies.
@cire: I have looked at sets. I think this is the way to go. Your suggested code will add all the random numbers at the end of the set, however, and I want to preserve the order. I will play around with this and get back. Creating new duplicates isn't a problem if I run the function recursively until there is no duplicate.

@GRex: I'd rather use a container than an array. The problem I see with the rough code is that the random number will also be at the end of the array/container rather than randomly. I want to replace either the first or the second duplicate value.

Are there any other standard ways to test for duplication?
Is it possible for you to avoid the duplicate values in the sequence when the sequence is built?

For example, to get a sequence of n non-duplicated random numbers in the range 0 to m-1

1
2
3
4
std::vector<int> seq(m) ;
for( int i=0 ; i<m ; ++i ) seq[i] = i ;
std::random_shuffle( seq.begin(), seq.end() ) ;
seq.resize(n) ;


Or if m is very large:

1
2
3
4
5
6
7
8
9
std::vector<int> seq(n) ;
for( std::size_t i=0 ; i<n ; ++i ) seq[i] = i ;
static std::random_device rdev ;
static std::mt19937 twister( rdev() ) ;
std::uniform_int_distribution<std::size_t> zero_to_m( 0, m-1 ) ;
std::uniform_int_distribution<std::size_t> zero_to_n( 0, n-1 ) ;
for( std::size_t i=n+1 ; i<m ; ++i )
    if( zero_to_m(twister) < n ) seq[ zero_to_n(twister) ] = i ;
std::random_shuffle( seq.begin(), seq.end() ) ;

The set is used so I don't have to search the vector. You'll notice the set isn't returned from the function, but the vector is modified.

It isn't guaranteed not to change non-duplicates, but it does preserve the order.
Last edited on
You could use a set of pointers (or iterators) to the vector elements, built with a custom comparison function that compares what the pointers are pointing to:

1
2
3
4
5
6
7
8
9
10
11
struct IndirectCompare {
    bool operator()(int* p1, int* p2) const { return *p1 < *p2; }
};
std::vector<int> uniquefy(std::vector<int> v)
{
    std::set<int*, IndirectCompare> s;
    for(int& n : v) // or whatever loop your compiler supports
        while(!s.insert(&n).second)
            n = rand();
    return v;
}


demo: http://ideone.com/owQOZ
A std::reference_wrapper<> would do quite nicely:

1
2
3
4
5
6
std::vector<int> uniquefy( std::vector<int> v )
{
    std::set< std::reference_wrapper<int> >  s ;
    for( int& n : v ) while( !s.insert(n).second ) n = rand() ;
    return v ;
}


It too won't meet the original requirement:
The problem I see with the rough code is that the random number will also be at the end of the array/container rather than randomly. I want to replace either the first or the second duplicate value.
These are great suggestions. I have a lot to learn.
@cire: why do you use std::vector<int>::size_type to iterate through the for loop in your example? I would have used an int. What advantage does size_type offer over int? Or unsigned int, (since the online documentation states member type size_type is an unsigned integral type). I do see now how the function preserves order.

@JLBorges: The code that generates the vectors will create duplicates on occasion, and this can't be avoided. Hopefully only one, but it could potentially duplicate four of the values. I do like the random sequence generation code, and I think I can use that. Does this need to be seeded to be random?
> Does this need to be seeded to be random?

All pseudo random number generators have to be seeded with a reasonably random value. For each generator, this needs to be done once, and only once. For example:

1
2
3
4
5
6
7
8
9
10
#include <random>
#include <ctime>

// return random int in [min_value,max_value]
int random_int( int min_value, int max_value ) 
{
    static std::mt19937 twister( std::time(nullptr) ) ;
    std::uniform_int_distribution<> distribution( min_value, max_value ) ;
    return distribution(twister) ;
}




> but it could potentially duplicate four of the values.
> I want to replace either the first or the second duplicate value.

Something like this would give a rough solution.

(Probabilities would not be 100% accurate if triplicates are encountered. And like in all the earlier solutions posted so far, a value which was not a duplicate in the original sequence could get replaced. For a version which would not have these issues, I can't -at least at this moment - think of anything better than sort - replace duplicates - restore original order)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <vector>
#include <map>
#include <iostream>

std::vector<int> uniquefy( std::vector<int> seq )
{
    std::map< int, std::size_t > map ;

    for( std::size_t i = 0 ; i < seq.size() ; ++i )
    {
        auto pair = map.insert( std::make_pair( seq[i], i ) ) ;
        if( !pair.second ) // duplicate
        {
            int v ;
            do v = random_int( 100 , 999 ) ; while( map.find(v) != map.end() ) ;
            seq[i] = v ;
            map[v] = i ;
            if( random_int(0,1) ) // with probability 0.5
            {
                auto iter = pair.first ;
                std::swap( seq[ iter->second ], seq[i] ) ;
                map[ iter->first ] = i ;
                map[v] = iter->second ;
            }
        }
    }

    return seq ;
}

int main()
{
    std::vector<int> v = { 111, 112, 113, 112, 111, 113, 113, 112, 111 } ;
    for( int i : v ) std::cout << i << ' ' ; std::cout << '\n' ;
    v = uniquefy(v) ;
    for( int i : v ) std::cout << i << ' ' ; std::cout << '\n' ;
}

Last edited on
smilingfrog wrote:
why do you use std::vector<int>::size_type to iterate through the for loop in your example? I would have used an int. What advantage does size_type offer over int? Or unsigned int, (since the online documentation states member type size_type is an unsigned integral type). I do see now how the function preserves order.


Because that's the type std::vector uses as a return type for size and expects for operator[]. Using int for sufficiently large vectors will have unexpected results. The number of elements a vector can hold may exceed the range of values an int can express.

A solution that preserves the values of non-duplicates by storing the position of duplicates and deferring replacement of them until after we know all values in the vector.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
void uniqueify(std::vector<int>& v)
{
	typedef std::vector<int>::size_type vsize_t ;

	if (v.size() < 2 )
		return ;

	std::set<int> processed ;
	std::vector<vsize_t> dup_loc ; 

	processed.insert(v[0]) ;
	for ( vsize_t i=1;  i < v.size() ; ++i )
	{
		auto n = processed.size() ;

		processed.insert(v[i]); 

		if ( n == processed.size() )
			dup_loc.push_back(i) ;
	}

	for ( vsize_t i=0; i < dup_loc.size(); ++i )
	{
		auto n = processed.size() ;
		int value ;

		while ( n==processed.size())
		{
			value = get_rand();
			processed.insert(value) ;
		}

		v[dup_loc[i]] = value ;
	}
}
Last edited on
This will take care of every requirement: non-duplicates would not get replaced; triplicates, quadruplicates etc. would be handled correctly with the element to be replaced being picked with equal probability. The code is perhaps easier to write, but is is not any faster than the 'sort - replace duplicates - restore original order' that smilingfrog had originally suggested.

1
2
3
4
5
6
7
8
9
10
#include <random>
#include <ctime>

// return random int in [min_value,max_value]
int random_int( int min_value, int max_value )
{
    static std::mt19937 twister( std::time(nullptr) ) ;
    std::uniform_int_distribution<> distribution( min_value, max_value ) ;
    return distribution(twister) ;
}


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <map>
#include <vector>

std::vector<int> uniquefy( std::vector<int> seq, int min, int max )
{
    std::map< int, std::vector<std::size_t> > map ;
    for( std::size_t i = 0 ; i < seq.size() ; ++i ) map[ seq[i] ].push_back(i) ;

    for( auto& pair : map ) while( pair.second.size() > 1 )
    {
        int n ;
        do n = random_int( min, max ) ; while( map.find(n) != map.end() ) ;

        int randpos = random_int( 0, pair.second.size()-1 ) ;
        map[n].push_back( pair.second[randpos] ) ;
        seq[ pair.second[randpos] ] = n ;
        pair.second.erase( pair.second.begin() + randpos ) ;
    }

    return seq ;
}


1
2
3
4
5
6
7
8
9
#include <iostream>

int main()
{
    std::vector<int> v = { 111, 112, 113, 112, 111, 113, 113, 112, 111 } ;
    for( int i : v ) std::cout << i << ' ' ; std::cout << '\n' ;
    v = uniquefy( v, 100, 999 ) ;
    for( int i : v ) std::cout << i << ' ' ; std::cout << '\n' ;
}
That's... kind of horrible. Creating a vector for every value in seq? I'd say it's actually likely to be much slower than the sort/replace/restore route.
> I'd say it's actually likely to be much slower than the sort/replace/restore route.

I'd expect both to have execution times that are roughly comparable; both need to keep track of where the duplicates were originally (to be able to select one at random); both need to ensure that the replaced values are themselves not duplicates.


> That's... kind of horrible. Creating a vector for every value in seq?

Looks frightening when you see something like this for the first time, doesn't it? A sequence of about a million elements (with a statistical expectation of about 230K of them being non-unique), takes about 4 seconds on my machine. Yes, with a million vectors being constructed and destroyed, a million+ calls to push_back, and hundred thousand+ calls to erase. Using an unordered_map<> instead of a map<> would make it faster, though not by very much.

Try it for yourself, with this modified main():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
int main()
{
    enum { N = 1024*1024, MAXV = N * 4 } ;
    std::vector<int> v(N) ;
    for( int& i : v ) i = random_int( 1, MAXV ) ;
    {
        std::map< int, int > dup_cnt ;
        for( int i : v ) ++dup_cnt[i] ;
        int dups = 0 ;
        for( const auto& p : dup_cnt ) if( p.second > 1 ) dups += p.second ;
        std::cout << "#elements: " << N << '\n'
                  << "#total non-uniques: " << dups << '\n' ;
    }

    std::clock_t start = std::clock() ;
    v = uniquefy( v, 1, MAXV ) ;
    std::clock_t end = std::clock() ;

    std::cout << (end-start) / double(CLOCKS_PER_SEC) << " seconds\n" ;
}


Last edited on
This is what I've come up with so far.

It satisfies my requirements to take a vector of elements, test it for duplicates, and replace either the first or the second duplicate with a random number.

Longer than the neat bits of code above, but thank you for the instruction. I hope I have incorporated some of it. Suggestions on improvement welcomed!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#include <iostream>
#include <algorithm>
#include <vector>
#include <map>

using namespace std;

void display(int i) {cout << ' ' << i;}
int random_number(int n) {
    float r =((float)random()/ RAND_MAX);
    return (int)(r*n);}

int main()
{
    srand( time(NULL));

    typedef vector<int>::size_type  v_size_type;
    typedef map<int, v_size_type>   map_type;

    typedef pair<v_size_type, v_size_type>      dup_pair_type;
    typedef vector<dup_pair_type>::size_type    d_p_size_type;

    int input_array[] = {1,12,8,1,7,15,20,9,12};
    vector<int> input_vector(input_array, input_array+(sizeof(input_array)/sizeof(int)) );
    vector<int> output;

    vector<dup_pair_type> location_of_duplicates;

    cout << "The starting vector is : " ;
    for_each(input_vector.begin(), input_vector.end(), display);
    cout << endl;

    map_type                        processed_map;
    map_type::iterator              map_it;
    pair<map_type::iterator, bool>  map_test;

    for (v_size_type i=0; i < input_vector.size(); i++) {
        map_test = processed_map.insert(pair<int, v_size_type>(input_vector[i], i));

        while (!map_test.second) {
            cout << input_vector[i] << " at position " << i << " is a duplicate value.";
            map_test = processed_map.insert(pair<int, v_size_type>(random_number(100),i));
            for (v_size_type j = 0; j < i; j ++){
                if (input_vector[j] == input_vector[i]) {
                    cout << " The duplicate is at position " << j << '\n';
                    pair<v_size_type, v_size_type>  duplicate_pair(j,i);
                    location_of_duplicates.push_back(duplicate_pair);
                    break;
                }
            }
        }
    }

    cout << endl;
    cout << " Map contains " << processed_map.size() << " values:\n";
    cout << "There are " << location_of_duplicates.size() << " duplicates\n";

    output = input_vector;
    for(map_it = processed_map.begin(); map_it != processed_map.end(); map_it++){
        int value = (*map_it).first;
        v_size_type vector_position = (*map_it).second;

        output[vector_position] = value;
    }

    for (d_p_size_type i = 0; i < location_of_duplicates.size(); i ++) {
        if(random_number(2)){   //((random swap));
            swap(output[location_of_duplicates[i].first], output[location_of_duplicates[i].second]);
        }
        cout << location_of_duplicates[i].first << ',' << location_of_duplicates[i].second << '\n';
    }

    for_each(output.begin(), output.end(), display);
    return 0;
}
Last edited on
Looks frightening when you see something like this for the first time, doesn't it? A sequence of about a million elements (with a statistical expectation of about 230K of them being non-unique), takes about 4 seconds on my machine. Yes, with a million vectors being constructed and destroyed, a million+ calls to push_back, and hundred thousand+ calls to erase. Using an unordered_map<> instead of a map<> would make it faster, though not by very much.


I will admit I'm surprised it didn't perform worse that it did when I tested it, but it was the extra time overhead+resource usage combination that I really boggled at. It simply wasn't necessary.

The last code I posted, modified slightly to use your random_int(), did average about 2/3 of the execution time on my system with both g++ and VC++10 (optimizations enabled of course.) It does satisfy all of the constraints (order preserved, non-duplicates preserved and all duplicates replaced.)

typical output:
#elements: 1048576
#total non-uniques: 232754
JLBorges: 1.517 seconds
Cire: 0.977 seconds

> It does satisfy all of the constraints (order preserved, non-duplicates preserved and all duplicates replaced.)

No, it certainly does not.

... The problem I see with the rough code is that the random number will also be at the end of the array/container rather than randomly. I want to replace either the first or the second duplicate value.




> The last code I posted, modified slightly to use your random_int(), did average about 2/3 of the execution time...

Yes.

And this code would have averaged exactly 0.00% of the execution time.
inline void uniqueify(std::vector<int>& v) {}

Unfortunately, like your code, it too does not meet the requirements.
Did you bother running the code I referred to? Hell, did you bother spending 90 seconds tracing through the execution logic?

Did you notice that the portion of post you quoted didn't, in fact, refer to any of the posts that I made and occurred well before the last post with code that I made?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <iostream>
#include <vector>
#include <iomanip>
#include <random>
#include <ctime>
#include <set>


int random_int( int min_value, int max_value )
{
    static std::mt19937 twister( std::time(nullptr) ) ;
    std::uniform_int_distribution<> distribution( min_value, max_value ) ;
    return distribution(twister) ;
}

void ems_uniquefy(std::vector<int>& v, int rmin, int rmax)
{
	typedef std::vector<int>::size_type vsize_t ;

	if (v.size() < 2 )
		return ;

	std::set<int> processed ;
	std::vector<vsize_t> dup_loc ; 

	processed.insert(v[0]) ;
	for ( vsize_t i=1;  i < v.size() ; ++i )
	{
		auto n = processed.size() ;

		processed.insert(v[i]); 

		if ( n == processed.size() )
			dup_loc.push_back(i) ;
	}

	for ( vsize_t i=0; i < dup_loc.size(); ++i )
	{
		auto n = processed.size() ;
		int value ;

		while ( n==processed.size())
		{
			value = random_int(rmin, rmax);
			processed.insert(value) ;
		}

		v[dup_loc[i]] = value ;
	}
}

std::ostream& print_on(std::ostream& os, std::vector<int> v)
{
	for ( auto i = v.begin(); i< v.end() ; ++i )
	{
		os << std::setw(3) << *i ;
	}
	return os << '\n' ;
}

int main()
{
	int arr1[20] = {  3,  7, 10, 20,  5,  3,  9,  8, 17, 19,  5, 10, 20, 11,  9,  2,  7,  2, 13 };
	int arr2[20] = { 13,  8, 19,  2,  7, 10,  8, 20, 19,  3,  5,  2, 19, 10,  4,  13, 9, 10,  2 };

	std::vector<int> v1(arr1, arr1+20) ;
	std::vector<int> v2(arr2, arr2+20) ;

	std::vector<int> v1_copy(v1) ;
	std::vector<int> v2_copy(v2) ;

	ems_uniquefy(v1, 1, 20) ;
	ems_uniquefy(v2, 1, 20) ;

	print_on(std::cout, v1_copy) ;
	print_on(std::cout, v1) ;

	std::cout << "\n\n" ;

	print_on(std::cout, v2_copy) ;
	print_on(std::cout, v2) ;
}


Run that as often as you care to. You will see no constraint violations. This is the version I used to test with (replete with the modifications to use your random_int() - lines 16 and 44.)


Last edited on
> Hell, did you bother spending 90 seconds tracing through the execution logic?

Yes, and about 90 seconds was all that was required to convince me that it does not meet the requirement.


> Did you bother running the code I referred to?

No, I didn't.

But just to humour you, just this once, I've now run it a million times. Ant it doesn't tell me anything more than what I already knew - it does not meet the basic requirements. Nothing surprising in that, since your ode makes no attempt to meet it.

I've used this modified main; I was not going to waste even more time pouring over the output of multiple runs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <cassert>

int main()
{
	int arr1[20] = {  3,  7, 10, 20,  5,  3,  9,  8, 17, 19,  5, 10, 20, 11,  9,  2,  7,  2, 13 };
	int arr2[20] = { 13,  8, 19,  2,  7, 10,  8, 20, 19,  3,  5,  2, 19, 10,  4,  13, 9, 10,  2 };

    for( int i=0 ; i<1000000 ; ++i )
    {
        std::vector<int> v1(arr1, arr1+20) ;
        std::vector<int> v2(arr2, arr2+20) ;

        ems_uniquefy(v1, 1, 20) ;
        assert( v1[0] == 3 ) ; // this assertion will nevel fail
        // because the duplicate value 3 at postion 0 will NEVER be replaced

        ems_uniquefy(v2, 1, 20) ;
        assert( v2[1] == 8 ) ; // this assertion will nevel fail
        // because the duplicate value 8 at postion 1 will NEVER be replaced
    }
}

Last edited on
Pages: 12