Finding Duplicates in sets of data

Pages: 123
I briefly saw someone post about this the other day and it encouraged me to write some code to do it. Any suggestions to further/alter it, or just general feedback would be great!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#include <iostream>
#include <vector>
#include <algorithm>
#include <string>

template <typename T> 
void findDuplicates( std::vector<T> Values) {
	std::vector<T> element(Values.size());
	std::vector<int> elementAmount(Values.size());
	std::vector<T>::iterator it;
	int i = 0, mycount = 0;

	for(std::vector<T>::iterator onceCycle=Values.begin(); onceCycle<Values.end(); onceCycle++) {
		it = find(element.begin(), element.end(), *onceCycle);
		if(it==element.end()) {
			element[i] = *onceCycle; 
			elementAmount[i++] = (int)count(Values.begin(), Values.end(), *onceCycle);
		}
	}

	for(int a=0;a<i;a++) {
		std::cout << "The element '" << element[a] << "' is ";
		elementAmount[a]==1 ? std::cout << " not duplicated\n" : std::cout << "is duplicated " 
									<< elementAmount[a]	<< " times\n";
	}

	return;
}
         
int main() {
	// * * * * I * N * T * E * G * E * R *S * * *  *
	std::vector<int> myVectorInt(7);

	myVectorInt[0] = 65;
	myVectorInt[1] = 66;
	myVectorInt[2] = 67;
	myVectorInt[3] = 67;
	myVectorInt[4] = 66;
	myVectorInt[5] = 65;
	myVectorInt[6] = 68;

	findDuplicates(myVectorInt);
	std::cout << '\n';

	// * * * * * * * C * H * A * R * S * * * * * * *
	std::vector<char> myVectorChar(7);

	for(int i=0;i<7;++i)
		myVectorChar[i] = (char)myVectorInt[i];

	findDuplicates(myVectorChar);
	std::cout << '\n';

	// * * * * * S * T * R * I * N * G * S * * * * * 
	std::vector<std::string> myVectorString(7);

	myVectorString[0] = "Hello";
	myVectorString[1] = "there";
	myVectorString[2] = "you";
	myVectorString[3] = "over";
	myVectorString[4] = "there";
	myVectorString[5] = "yes";
	myVectorString[6] = "you";

	findDuplicates(myVectorString);

	std::cin.get();
	return 0;
}
Last edited on
I'm sensing performance war here :)
Here is mine:
1
2
3
4
5
6
7
8
9
10
11
12
13
template<class T>
void findDuplicates_r0mai(const std::vector<T>& v) {

    std::map<T, unsigned> counter;

    for(unsigned i = 0; i < v.size(); ++i) {
        ++counter[v[i]];
    }

    for(std::map<T, unsigned>::iterator it = counter.begin(); it != counter.end(); ++it) {
        std::cout << "Element '" << it->first << "' found " << it->second << " times\n";
    }
}
Haha im going to test this later with 1,000,000 elements and see who's is faster!
Okay, well I'll chip in mine. Its destructive in that it sorts the input vector:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <vector>
#include <iostream>
#include <algorithm>

template <typename T>
void findDuplicates(std::vector<T>& values)
{
	typedef typename std::vector<T>::iterator iter;

	std::sort(values.begin(), values.end());
	iter i = values.begin();
	iter end = values.end();
	while(i != end)
	{
		std::pair<iter, iter> p = std::equal_range(i, end, *i);
		std::cout << *p.first << " has " << p.second - p.first << '\n';
		i = p.second;
	}
}

int main()
{
	std::vector<int> values;
	values.push_back(65);
	values.push_back(66);
	values.push_back(67);
	values.push_back(67);
	values.push_back(66);
	values.push_back(65);
	values.push_back(68);

	findDuplicates(values);
}
65 has 2
66 has 2
67 has 2
68 has 1
Last edited on
mcleano wrote:
Haha im going to test this later with 1,000,000 elements and see who's is faster!

My guess is that R0mai's will be faster :P std::find and std::count perform linear search while std::map::operator[] performs binary search.

I also believe Galik could do it faster :P equal_range performs a couple of binary searches to find lower and upper bound. Since the vector is already sorted and what you only want is the number of times each element appears, you could just traverse the vector once and count the elements as you go. When the element value changes, output the old value and the count, then zero the count and move on.

EDIT: Mmmm... No, I'm not really sure if that would be faster...

EDIT 2: Ok, let me take back what I said about Galik's approach haha. Its complexity is O(logn) and the complexity of what I suggest is O(n), clearly slower for big n.
Last edited on
r0shi you can't pick sides lol! Write you're own and join in, I'll test them once everyone's is in.

std::find performs linear search while std::map::operator[] performs binary search.

I actually looked at using the binary_search() function found in <algorithm> but it said the set had to be sorted which gave me two problems:

1) After using std::sort it was still giving me a sorting error lol?
2) As Galik said, it adds more "destructive" computation.

Hmm I guess mine might lose performance points for keeping a vector record of the duplicates. On the other hand if you wanted to do further computation on the data I gain points :)
It would be interesting to see timings for both destructive and non-destructive versions. You could simply remove the reference from the parameter forcing a copy. You would have to normalise other things as well like the precise output format.
Ok, here's a slightly faster (though of same complexity) variation of Galik's approach:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#include <vector>
#include <iostream>
#include <algorithm>

template <typename T>
void findDuplicates(std::vector<T>& values)
{
	typedef typename std::vector<T>::iterator iter;

	std::sort(values.begin(), values.end());
	iter i = values.begin();
	iter end = values.end();
	while(i != end)
	{
		iter it=upper_bound(i,end,*i);
		std::cout << *i << " has " << it-i << "\n";
		i=it;
	}
}

int main()
{
	std::vector<int> values;
	values.push_back(65);
	values.push_back(66);
	values.push_back(67);
	values.push_back(67);
	values.push_back(66);
	values.push_back(65);
	values.push_back(68);

	findDuplicates(values);

	std::cout << "hit enter to quit...";
	std::cin.get();
	return 0;
}

I also have something else in mind... I won't tell you now what it is, but if it works it's gonna be even faster but it'll take up some memory. I'll post it after I manage to implement it and prove to myself it is good :P
Last edited on
r0shi post it up when you get a chance and I'll test it now but these are the result so far using the following test:

1
2
3
4
5
6
7
8
9
10
11
12
13
	srand((unsigned)time(NULL));
	std::vector<int> randNums(100000);

	for(int i=0; i<100000; ++i)
		randNums[i] = rand() % 99 + 1;
	
	clock_t start = clock();

	std::cout << "Starting clock... \n\n";

	findDuplicates_withMcleano(randNums);

	std::cout << "\nTime elapsed: " << ((double)clock() - start) / CLOCKS_PER_SEC;


Each test is done 3 times then an average formed.

mcleano
Test 1: 14.061
Test 2: 14.503
Test 3: 12.534

avg: 13.699

R0mai
Test 1: 0.648 :O
Test 2: 0.632
Test 3: 0.684

avg: 0.655

Galik
Test 1: 2.674
Test 2: 2.496
Test 3: 2.553

avg: 2.574

m4ster r0shi
Test 1: 2.666
Test 2: 2.653
Test 3: 2.198

avg: 2.506

Wow keeping vector records really is expensive!
I am thinking that R0mai's method is only as slow as a sort, which is basically all it does by moving the unsorted vector into the sorted key table of the std::map, keeping tally on the way.
@Galik:

Indeed.

@mcleano:

Don't count that method as mine. It's practically the same as Galik's.
Galik's solution hasn't got any additional memory requirements (although only if the caller doesn't mind that the passed vector is being sorted). Mine uses at the worst case N*sizeof(T) + N*sizeof(unsigned) memory, which might make my solution worse than Galik's in some cases.
It's interesting the different solutions people come up with and what you can learn from other peoples code. I personally have never used std::pair or std::map and so would never have used them for this task but I will definately research them so thank you!

@r0shi ... sure thing, are you still working on your newer version?

*sigh* it sucks to come last, on both perforance and memory consumption :(
mcleano wrote:
are you still working on your newer version?

No, I stopped after seeing R0mai's times...
Ok, this is a very primitive version of what I have in mind. It only works for bytes (unsigned chars). Its complexity is O(n), so it's even faster than R0mai's O(nlogn) :P I have some ideas on how to expand it so that it works for 16 and 32 bit unsigned integers without using (much) more memory but maybe you have better ones. Let's build this together.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#include <iostream>
#include <vector>
#include <cstdlib>
#include <ctime>
using namespace std;

typedef unsigned char UCHAR;
const int print_results=false;
const int SIZE=1000000;

void count(vector<UCHAR> & byte_vec)
{
    int counter[256]={0};

    vector<UCHAR>::iterator it;

    int start=clock();

    for (it=byte_vec.begin(); it!=byte_vec.end(); it++)
        counter[*it]++;

    int stop=clock();

    cout << "time taken: " << stop-start << " ms" << endl;

    if (print_results)
    {
        cout << "hit enter to view the results..." << endl;
        cin.get();

        for (int i=0; i<256; i++)
            cout << i << ' ' << counter[i] << endl;
    }
}

int main()
{
    vector<UCHAR> byte_vec;

    byte_vec.resize(SIZE);
    for (int i=0; i<SIZE; i++)
        byte_vec[i]=rand()%256;

    count(byte_vec);

    cout << "hit enter to quit..." << endl;
    cin.get();

    return 0;
}

Last edited on
closed account (EzwRko23)
I know this is NOT C++, but you guys always say that C++ is so fast.
So, here is the Scala version:

1
2
3
4
5
6
7
import collection.mutable.Map
def countDuplicates[T](col: Iterable[T]) = {
  val m = Map[T, Int]()
  for (elem <- col)
    m.put(elem, m.getOrElse(elem, 0) + 1)
  m
}


I changed ROmai's code not to print the results (because I don't want to benchmark stdio) and
also to repeat each iteration 50 times. My code was also run 50 times and average was taken.
The results are for a vector of 1 000 000 integers from range 1..9999.

I compiled ROmai's code with gcc 4.2.x and optimized with -O2 and additional processor specific flags (for core2duo).

ROmai's result.


Starting clock...


Time elapsed: 0.1479
Process returned 0 (0x0) execution time : 7.426 s
Press any key to continue.


Scala's result:

"C:\Program Files\Java\jdk1.6.0_21\bin\java" -server -Xmx512M -Didea.launcher.port=7537... com.intellij.rt.execution.application.AppMain pl.edu.pw.ii.cmsbb.DuplicatesTest
Average from 50 iterations: 0.11664 s

Process finished with exit code 0



I could write the code in just one line, in a functional style, but then it is slightly slower (but still linear and still signifficantly faster than most of the C++ implementations posted here):
 
array groupBy identity map (x => (x._1, x._2.size))


I haven't timed the last master's r0shi implementation because it is not universal - it requires elements to be from a narrow numeric range and not just from any arbitrary set, so this is cheating and should not count ;)
Last edited on
How much RAM did it take to run that, and does that figure include loading the execution environment and the program itself?
closed account (EzwRko23)
1. I haven't tuned it for RAM, but comparing RAM usage for such small microbenchmarks is useless - it would be benchmarking just constant runtime footprints instead of algorithms.

2. Time to load environment is similar to the time of C++, below 0.2 s. So it really doesn't count when 50 iterations are made.
Last edited on
it would be benchmarking just constant runtime footprints instead of algorithms.
That's kinda the idea.

below 0.2 s
I don't know. You seem to be invoking a preloaded server, there.
closed account (EzwRko23)
1. Ok, I measured the minimal heap of JVM that is required to run my example faster than the R0mai's code: 8200 kB. The C++ App requires 4200 kB of the heap. So, less than a 2x memory overhead when I want my code to run fast. But I also was able to run it with much less memory (5MB) making it a little slower. When increased the problem size 10x, and using an array 40MB large, now the overhead dropped. I could match the speed of the C++ version with just 58 MB, so 45% memory overhead for GC. But I'm by no means expert of tuning JVMs. It is probably possible to get even lower.


That's kinda the idea


So, are you ready to compare memory footprint of your entire OS with the memory footprint taken by JVM and my OS? Your OS is host of your C++ app. Your C++ app uses the code that is there. You won't be able to run your C++ without the OS just as I won't be able to run mine without OS (and possibly JVM, but this is not a hard requirement, cause static Java compilers exist). So either we compare total footprint of the runtimes as a whole or just the heaps. Comparing size of the heap of the C++ app with the runtime part of the JAVA app is pointless.

2. No, I'm not. Warm JVM startup times are very fast. What is slow is when you need to load a 200 MB app from the disk like IntelliJ IDEA or Eclipse. But this is slow in any language.
Last edited on
Pages: 123