Removing Duplicates

Pages: 12
Many thanks Kemort and JlBorges. I've learnt a huge amount from this thread, despite not being the originator. I guess the timing evidence from JLBorges speaks for itself and I will endeavour to get comfortable with the STL.

Best.
closed account (48T7M4Gy)
:)
> I guess the timing evidence from JLBorges speaks for itself
> and I will endeavour to get comfortable with the STL.

One good reason to use the facilities available in the standard library is performance: we can reasonably expect that the implementers of the library would have been optimising their code bit by bit over a long period of time.

Though it is possible for us to write home-grown code which would give performance comparable to (and in a few cases, better than) the equivalent code in the standard library, the library code is more robust, more mature; it is code that has been used in a large number of production code bases.

Another good reason to use the library facilities is that idiomatic constructs make the code clearer and more transparent. Elaborated in this post: http://www.cplusplus.com/forum/general/203108/#msg965605

Here are two hand coded versions, which incorporate the performance hints mentioned earlier in http://www.cplusplus.com/forum/beginner/203059/#msg965764
We can see that their performance is comparable to the one using the standard algorithm.
(Moving instead of copying is omitted - it doesn't make a difference when values are of a scalar type like int).

Note that neither the standard library algorithm nor the hand coded versions require that we ever need to go back to an earlier position by using a decrement operation on the position or the pointer/iterator. The standard specifies only that the iterator used with std::remove must meet the requirements of ForwardIterator - it need not even have a decrement operator.

Also note that in today's implementations, using an iterator (pointer) for iteration is typically faster than iterating using the subscript operator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
void use_std_algo_remove_dup( std::vector<int>& num )
{
    auto end_unique = std::end(num) ;
    for( auto iter = std::begin(num) ; iter != end_unique ; ++iter )
    {
        // http://en.cppreference.com/w/cpp/algorithm/remove
        end_unique = std::remove( iter+1, end_unique, *iter ) ;
    }
    num.erase( end_unique, std::end(num) ) ;
}

std::size_t by_pos_find( const std::vector<int>& vec, std::size_t from, std::size_t till, int value )
{
    for( ; from < till ; ++from ) if( vec[from] == value ) break ;
    return from ;
}

std::size_t by_pos_remove( std::vector<int>& vec, std::size_t from, std::size_t till, int value )
{
    from = by_pos_find( vec, from, till, value ) ;
    if( from != till )
    {
        for( auto i = from+1 ; i < till ; ++i )
            if( vec[i] != value ) vec[from++] = vec[i] ;
    }
    return from ;
}

void hand_coded_by_pos_remove_dup( std::vector<int>& num )
{
    auto end_unique = num.size() ;
    for( std::size_t i = 0 ; i != end_unique ; ++i )
        end_unique = by_pos_remove( num, i+1, end_unique, num[i] ) ;
    num.resize(end_unique) ;
}

int* by_ptr_find( int* from, int* till, int value )
{
    for( ; from < till ; ++from ) if( *from == value ) break ;
    return from ;
}

int* by_ptr_remove( int* from, int* till, int value )
{
    from = by_ptr_find( from, till, value ) ;
    if( from != till )
    {
        for( auto ptr = from+1 ; ptr < till ; ++ptr )
            if( *ptr != value ) *from++ = *ptr ;
    }
    return from ;
}

void hand_coded_by_ptr_remove_dup( std::vector<int>& num )
{
    if( num.size() < 2 ) return ;

    auto end_unique = &num.front() + num.size() ;
    for( auto ptr = &num.front() ; ptr != end_unique ; ++ptr )
        end_unique = by_ptr_remove( ptr+1, end_unique, *ptr ) ;

    num.resize( end_unique - &num.front()  ) ;
}


LLVM and GNU:
     use_std_algo_remove_dup:   260 millisecs.
hand_coded_by_pos_remove_dup:  1200 millisecs.
hand_coded_by_ptr_remove_dup:   300 millisecs.
ok



     use_std_algo_remove_dup:   150 millisecs.
hand_coded_by_pos_remove_dup:   160 millisecs.
hand_coded_by_ptr_remove_dup:   140 millisecs.
ok

http://coliru.stacked-crooked.com/a/08110b76e91635d8

Microsoft:
     use_std_algo_remove_dup:   128 millisecs.
hand_coded_by_pos_remove_dup:   175 millisecs.
hand_coded_by_ptr_remove_dup:   122 millisecs.
ok

http://rextester.com/DTXYA52427
Last edited on
Short, simple and very fast (and preserves the original order):
amortised O(N) time (where N is the size of the sequence),
however, needs O(M) space (where M is the number of unique values).

1
2
3
4
5
6
7
void use_std_hash_remove_dup( std::vector<int>& num )
{
    std::unordered_set<int> set ;
    std::size_t pos = 0 ;
    for( int v : num ) if( set.insert(v).second ) num[pos++] = v ;
    num.resize(pos) ;
}


C++98, we can use std::set instead: slower, O( N log N )

LLVM and GNU:
       use_std_algo_remove_dup: 10070 millisecs.
C++11: use_std_hash_remove_dup:   110 millisecs.
C++98:  use_std_set_remove_dup:   370 millisecs.
ok



       use_std_algo_remove_dup:  6360 millisecs.
C++11: use_std_hash_remove_dup:    30 millisecs.
C++98:  use_std_set_remove_dup:   450 millisecs.
ok

http://coliru.stacked-crooked.com/a/8521d63329ec8b18

Microsoft:
       use_std_algo_remove_dup:  4753 millisecs.
C++11: use_std_hash_remove_dup:    86 millisecs.
C++98:  use_std_set_remove_dup:   476 millisecs.
ok

http://rextester.com/IEJ70062
Topic archived. No new replies allowed.
Pages: 12