Listless pigeonholing.

Hello,

I'm trying to analyze a 2D (cartesian) space. On the plane are a number (N) of points identified by their coordinates (x,y) and a colour, modeled by a value [0,C[. I'm dividing up the space into blocks (B) of a certain size to get a grid. Then, I wish to know which colours are present in any given block. C can be several thousands, N several millions; B could vary heavily, but will be significantly smaller than N.

The obvious algorithm for divide-into-blocks is Pigeonholing, which adds each point to a linked list per block. However, I don't care about the points, just the (unique) colours.

This would require me to check for each node whether its colour is already present in the block. With a linked list, this could cost me up to C comparisons per node (C*N comparisons), though the actual amount of colours per block will probably not be too big.

A second option would be to allocate C bools (or bits) for each colour in each Block and set it to 1 if a colour is present in that block. This would cost me C space for all B blocks (C*B). That's quite a bit of space, especially when the amount of colours per block is far lower than C.


So, does anyone have an idea how to efficiently do this? In the average case, the linked list option will be rather computationally efficient, but I'm not a fan of the memory overhead they bring, especially since the elements are so small (int/short).

The gridding will be done frequently, but you can assume that (nearly) all the data gathered during the gridding will be discarded afterwards due to changes in the space and gridding settings. [I might be able to minimize this, if it helps!]

Any insights are welcome!
If this were my project, I'd probably start by creating a std::set<> where comparison would rely solely on the color value. In the end, you'll have a std::set<> for each block with a single point instance (and its associated color value) for each unique color.

The bool option is not very good if you are using 24-bit or 32-bit color depths. While you could use a std::vector<bool> for this to minimze the space (it is specialized to use one bit instead of one byte per entry), the actual enumeration of colors inside a block would require a loop of O(ColorDepth), checking if the bool flag is set or not.
Last edited on
The set option seems "dangerous"; the C++ ref page mentions BSTs as a common implementation. I fear the overhead of insertions (and balancing?) could easily outweigh the benefit of logarithmic search complexity.

The first sentence of your second paragraph makes me regret using the word "colour". Basically, it's just a number, and a small one at that (few thousands, so probably a 'short'). I'm not sure how vector<bool> (or any kind of bitfield) is implemented, though. Wouldn't binary operators be sufficient to check/set bitflags? Or is it still expensive to access single bits? (I guess I could technically do without checking the flag before setting, and just count at the 1's at the end.)
Well, if linear O(n) enumeration is not a problem for the "color" value, then how about a std::bitset<>? http://www.cplusplus.com/reference/stl/bitset/

Regarding BST's, they are the overall beasts in performance for both read and write access, especially for large sets. If you consider that your set won't be large enough, see if you can maintain an ordered std::vector<>, or write your own set class.
I'll give them both a try. Maybe I can implement both and pick a method based on density of colours per block. Thanks for the tips!
How big are your Bs? What is wrong with a std::set <>?
http://www.cplusplus.com/reference/stl/set/
Basically, it does what you want: a unique sort; and it is pretty efficient about it (usually, depending on whoever wrote your STL implementation).

If you are doing image processing of some sort, where the user will frequently change your grid, etc., then cheat.

Keep yourself a secondary matrix that has a preset, smallish grid that keeps this information. Then, when the user chooses his grid, which should typically be larger than the grid for your preset, you only need some collation. Depending on how accurate the user wants you to be with a specific block, you need only do some minor additional sampling around the edges.

Some thoughts.
B's are going to be in the 100~1000 range, I'm guessing.

The cheat is genius; I'll definitely be able to use it. Generally, only a small amount of blocks will be changed at each step, and it'll be sufficient to update those. The granularity of Block sizes isn't much a problem.

Now I'm torn, though. With so much reuse possible, I'm considering saving more information per block. Back to the drawing board!
> especially when the amount of colours per block is far lower than C.

Use a std::vector<short> per block to represent the colours in that block?

If several blocks may have identical colour values, use a flyweight of these vectors?
http://en.wikipedia.org/wiki/Flyweight_pattern

If several blocks may have identical partial colour values, use a flyweight of these commonly repeated partial vectors?
Last edited on
Seems a bit overly complicated and most likely there won't be much 'common' colour combinations in blocks. However, I'll keep it in mind in case empiric evidence suggest otherwise!
Topic archived. No new replies allowed.