Efficient way to keep tallies?

Pages: 12
Hey all,

I previously wrote a class that basically used maps to create a counter object, which functions something like this. A blank counter is just {}, then if I do something like increment_count("red", 1), counter becomes {"red":1}, and then increment_count("blue",2), counter becomes {"red":1, "blue":2}. However, when creating thousands of these and accessing data, I get terrible performance. I don't know if this is because of the inherent weaknesses of maps or because my code sucks.

If the former is true, what's a good structure to use to efficiently perform these kinds of operations (including stuff in boost libs)?

Here's what I wrote. Please alert me to ANY questionable/bad practices I may be engaging in and if there's anyway to improve any of the functions I wrote.

Thanks so much!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#ifndef __COUNTER_H_INCLUDED__
#define __COUNTER_H_INCLUDED__

//======================
// INCLUDED DEPENDENCIES
#include <iostream>
#include <map>
//======================


//======================
// INTERFACE
template <class Key, class Val>
class Counter
{
protected:
	std::map<Key, Val> counter;
	Val sum;
public:
	Counter();
	Val& operator[] (Key);
	void increment_count(Key,Val);
	void increment_all(Val);
	void normalize();
	void print();
	Key argmax();
};
//======================

template<class Key, class Val>
Counter<Key,Val>::Counter()
{
	sum = Val();
}

template <class Key, class Val>
Val& Counter<Key,Val>::operator[] (Key x)
{
	if (counter.find(x) != counter.end())
		return counter[x];
	else
	{
		counter.insert(std::pair<Key,Val>(x, Val()));
		return counter[x];
	}
}

template <class Key, class Val>
void Counter<Key,Val>::increment_count(Key key, Val incrementVal)
{
	if (counter.find(key) != counter.end())
	{
		counter[key]+=incrementVal;
		sum+=incrementVal;
	}
	else
	{
		counter.insert(std::pair<Key,Val>(key,incrementVal));
		sum+=incrementVal;
	}
}

template <class Key, class Val>
void Counter<Key,Val>::increment_all(Val incrementVal=1)
{
	for (std::map<Key, Val>::iterator it = counter.begin(); it != counter.end(); it++)
	{
		it->second += incrementVal;
		sum+= incrementVal;
	}
}

template <class Key, class Val>
void Counter<Key,Val>::normalize()
{
	for (std::map<Key, Val>::iterator it = counter.begin(); it != counter.end(); it++)
		it->second/=sum;
}

template<class Key, class Val>
void Counter<Key,Val>::print()
{
	std::cout << "{";
	std::map<Key, Val>::iterator it = counter.begin();
	if (it != counter.end())
	{
		std::cout << it->first << ": " << it->second;
		it++;
	}
	for (it; it != counter.end(); it++)
		std::cout << ", " << it->first << ": " << it->second ;
	std::cout << "}\n";
}

template<class Key, class Val>
Key Counter<Key,Val>::argmax()
{
	std::map<Key, Val>::iterator it = counter.begin();
	Val max = it->second;
	Key argmax = it->first;
	for (it; it != counter.end(); it++)
		if (it->second > max)
		{
			max = it->second;
			argmax = it->first;
		}
	return argmax;
}

#endif 
Last edited on
1
2
3
4
5
6
7
if (counter.find(x) != counter.end())
	return counter[x];
else
{
	counter.insert(std::pair<Key,Val>(x, Val()));
	return counter[x];
}

Map's operator [] does exactly that. http://www.cplusplus.com/reference/stl/map/operator[]/
Oh wow, cool. Ok after shaving that off the time delay is still around 3 seconds. Is there anything else I can do to make the class efficient?
Without knowing how you are actually calling it, it is very hard to optimize. Things worth considering:

What is your key type? If it is a string as hinted in your first post, then operator< (used by map) is going to be slower than it would be for an integer type.

If increment_all is something you do fairly regularly, then store it in a separate count and change your operator[] to return the modified value.

if argmax is something you do a lot then cache it and update in calls to increment.

Consider a hashing container instead of a sorting one

Edit:

I had a problem where my key was actually a vector<int> with upto 200 elements between 0 and 1000, but many keys had common substrings. I have a specially written tree to store these, but it only supports increment and print - so possibly no good for you. I was putting around 5000000 different keys in it though and it ran fine. Took about 10 minutes to destruct the tree though.
Last edited on
When I use "counter" I actually use ints to key. Also the only functions I use extensively are increment_count and normalize-- argmax is just used once.

Do you think I'll see any performance increase if I implement the counter class using, say, vectors somehow instead of a mapping?

If you want to see the code where I use the counter I can post it up here.

Thanks!
A vector will be a lot better if there aren't big holes between the keys.
Also the only functions I use extensively are increment_count and normalize
¿why is normalize called so often?
I suppose that you don't increment the same key over and over again (but use the delta)

Later, when you change the type of the key, considerer passing it to the functions by constant reference instead of value. I wonder if there is an easy way to establish that (basic types->copy, structures->constant reference)

¿What are the limits of the data?
Last edited on
I'm still not sure how to go about it with vectors because I'd really like a simple key/value pair association. I guess parallel vectors that emulate the key / value pairing?

Normalize is called often because I'm writing a Bayesian classifier-- from a training image set, where each image has a classification (say "i" or "o"), I go through each pixel (which is either on or off) and increase the tally in the counter for that specific pixel and classification. Once all the images are done I need to normalize each counter because I want a probability.

I suppose that you don't increment the same key over and over again (but use the delta)


What do you mean by 'use the delta'?

I debated constant ref vs. value passing, but I figured I wanted to be on the safe side and just copy (it'll take up more memory but it shouldn't slow me down significantly, should it?).

I'm not quite sure what you mean by the limits of the data.

Thanks!
Last edited on
The most sensible thing to do would be to use a profiler to find out where the bottleneck is.


Once all the images are done I need to normalize each counter because I want a probability.


That's not often, not compared to the number of calls to increment_count

What does they key generation function look like? If you have a significant bottleneck, that would be my guess. But best to run a profiler
zeromeus wrote:
I'd really like a simple key/value pair association.

You can do that with an array too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#include <iostream>

enum { RED = 0, GREEN = 1, BLUE = 2 };

int main()
{
    double counter[3];

    counter[RED]   += 1;
    counter[GREEN] += 2;
    counter[BLUE]  += 3;

    std::cout << counter[RED]   << std::endl;
    std::cout << counter[GREEN] << std::endl;
    std::cout << counter[BLUE]  << std::endl;

    return 0;
}

EDIT:

What are the keys you use for this? Is their
number fixed and known during compilation?

Last edited on
The keys are classification types, like "i" or "o", but I assign these a unique int like you did using an enum. Problem is, I don't want to hardcode the keys because I'll be introducing new ones in the future. So no, the number is not fixed.

Also, for some reason when I run the profiler in visual studio the application just hangs and no data is collected.
Problem is, I don't want to hardcode the keys because I'll be introducing new ones in the future.


I don't understand the problem - in the future change the enum.

I don't think that's the problem though, as the number of comparisons a map does is logorithmic in the number of elements, and integer comparisons are fast. I still believe your bottleneck is somewhere else.

Also, for some reason when I run the profiler in visual studio the application just hangs and no data is collected.


Can't help here I'm afraid, I use gprof under linux.
I don't understand the problem - in the future change the enum.


It could be very inconvenient/impossible for future applications where both the number of keys and the type of key are unknown beforehand. If at all possible I'd like to keep an enum out of it.

Yea, I'm confused too. I don't think the map is the problem, but I don't know how else to make this leaner and meaner.
If you stick with the map, you should also fix this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
template <class Key, class Val>
void Counter<Key,Val>::increment_count(Key key, Val incrementVal)
{
	if (counter.find(key) != counter.end())
	{
		counter[key]+=incrementVal;
		sum+=incrementVal;
	}
	else
	{
		counter.insert(std::pair<Key,Val>(key,incrementVal));
		sum+=incrementVal;
	}
}

The way you do it now, if the key already exists,
you search for it twice. You can just do this instead:

1
2
3
4
5
6
template <class Key, class Val>
void Counter<Key,Val>::increment_count(Key key, Val incrementVal)
{
    counter[key]+=incrementVal;
    sum+=incrementVal;
}
Oh, yea, sorry I realized that and fixed that before. Ok I ran the profiler and apparently anything where I use counter[key]. This leads me to believe that it's std::map that is responsible for the slowdown.

I'm going to try an unordered map and see if that helps anything. I don't really know what else to do because the map is the ideal structure for this-- not sure how I could use a vector/deque/list as cleanly as, if not cleaner than, the map.
With an integer key? I'm surprised.

How many elements are in the map?
How many times are you calling operator[]
What kind of time delay are you getting?
I am too. There are only 2 elements in each map as of now, but I call the operator[] around 1600 times (still shouldn't be anything close to major). I'm getting around 1-2 seconds of delay.
I think you have some kind of mistake somewhere, with optimization disabled, the following code gives the following output an execution times.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

#include <map>
#include <cstdlib>
#include <iostream>

int main() {
        std::map<std::string, int> counts;

        for(int i=0;i!=100000;++i) {
                double x = double(rand())/double(RAND_MAX);
                double y = double(rand())/double(RAND_MAX);

                if(y < x*x) {
                        ++counts["hit"];
                } else {
                        ++counts["miss"];
                }
        }

        std::cout << counts["hit"] << ":" << counts["miss"] << std::endl;
        return 0;
}


1
2
3
4
5
6
7
kev82@debian:~$ g++ -O0 test.cc 
kev82@debian:~$ time ./a.out 
33382:66618

real	0m0.067s
user	0m0.052s
sys	0m0.008s
Ok, now I'm really confused. My counter can keep up with the code you provided when doing the same task. And the number of calculations involved is far greater than what I ultimately need to perform.

Well I got some poking around to do.
If the code is simple to compile/run on linux, upload it somewhere and I will profile it for you, and tell you where the bottleneck is.
Alright, here's the link:

http://www.2shared.com/file/gWOlcoFG/bayes.html

Thanks so much!
Pages: 12