Library with high perfomance low length integers (e.g. int4_t, int3_t)

Are there any libraries that support low range integer and have high performance

I am particularly interested in being able to easily create packed low length int arrays like this (to achieve high cache efficiency):

1
2
int4_t some_counters[99]
int3_t some_states = new int3_t_array(some_variable_size)


I could find libraries for high length and arbitrary length integers, but none for low precision.

P.S. I was not sure if this should have gone to the "beginners" forum thread

Note: the original title of this thread was :Library with high performance low precision integers (e.g. int4_t, int 3_t)
Last edited on
I don't think "low precision" is the right term. The integers are precise, the range is just limited.
@peter87 you are right i updated the title the thread, Thanks
Note: boost does use the term perscion for their intgers
Last edited on
I'm not aware of any such libraries but they wouldn't be hard to write. You might start with a BCD library.

I am particularly intresteaded[sic] in being to able to easily create packed low precision arrays like this (to achieve high cache efficiency):

Beware of premature optimization. I'd code it with char's first.
Last edited on
> to achieve high cache efficiency
I think you're just going to trade one inefficiency for another.

Let's take your int3_t as an example, and suppose you want to be super memory efficient by only using bytes as the storage array internally. Here, 'a' to 'p' represent 16 int3_t's packed into 6 bytes (or 48 bits).
1
2
3
4
5
6
7
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
|a|a|a|b|b|b|c|c| |c|d|d|d|e|e|e|f| 
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
|f|f|g|g|g|h|h|h| |i|i|i|j|j|j|k|k| 
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+
|k|l|l|l|m|m|m|n| |n|n|o|o|o|p|p|p| 
+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+

Now reading a or b is relatively simple, it's just read, shift and mask.
But reading c is a mess. You have to read, shift and mask from two bytes, and then add the two parts together.

As for writing, you first have to read and mask what's already there, then shift and mask the thing you want to write, before adding the two together to write back out to memory.
And you have to do all that TWICE for the c,f,k,n values in the diagram above.

Or perhaps you go for something slightly less memory efficient, say packing 10 int3_t's into a 32-bit word (or 21 of them into a 64-bit word if you have a 64-bit architecture). Sure you waste a bit or two, but at least you only ever have one Read-Modify-Write cycle to update any element.

For data that is aligned with how memory and processors are organised, each read and each write is achieved in a single bus transaction.

Salem c
Yeah that is what I will be doing, I was just hoping there were ready libraries, instead of me having to write (and test!) new ones


Or perhaps you go for something slightly less memory efficient, say packing 10 int3_t's into a 32-bit word (or 21 of them into a 64-bit word if you have a 64-bit architecture). Sure you waste a bit or two, but at least you only ever have one Read-Modify-Write cycle to update any element.


If you want to be cache access efficient you have to match the cache line length of your microarchitecture (64bit in my case of intel skylake)
Maybe something like this would be good enough.
I avoided trying to pack values across element boundaries.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <iostream>
#include <vector>

// ElementType should be an unsigned type.
template<typename ElementType, unsigned BitsPerValue>
class smalluint_array
{
    // Assuming 8 bits per char.
    static constexpr unsigned ValuesPerElement {sizeof(ElementType) * 8 / BitsPerValue};
    static constexpr ElementType Mask {(ElementType(1) << BitsPerValue) - 1};

    std::vector<ElementType> v;

public:
    smalluint_array(size_t size)
        : v((size + (ValuesPerElement-1)) / ValuesPerElement)
    { }

    unsigned get(size_t i)
    {
        return (v[i/ValuesPerElement] >> (i%ValuesPerElement * BitsPerValue)) & Mask;
    }
    
    void set(size_t i, unsigned n)
    {
        unsigned x = i % ValuesPerElement * BitsPerValue;
        v[i/ValuesPerElement] &= ElementType(-1) ^ (Mask << x);
        v[i/ValuesPerElement] |= (n & Mask) << x;
    }
};

int main()
{
    using std::cout;
    using uint4_array = smalluint_array<uint8_t,  4>;
    using uint3_array = smalluint_array<uint64_t, 3>;
/*
For uint3_array: using uint8_t              is 4.00 bits per value;
                 using uint16_t or uint32_t is 3.20 bits per value;
                 using uint64_t             is 3.05 bits per value.
*/

    uint4_array a(100);
    for (unsigned n: {5, 1, 4, 6, 11, 15, 3, 8})
        a.set(n, n);
    for (unsigned n: {5, 1, 4, 6, 11, 15, 3, 8})
        cout << a.get(n) << ' ';
    cout << '\n';

    size_t i = 0;

    uint3_array b(100);
    for (unsigned n: {5, 1, 4, 6, 3, 2, 7})
        b.set(i++, n);
    for (i = 0; i < 7; ++i)
        cout << b.get(i) << ' ';
    cout << '\n';


    for (unsigned n: {0, 1, 2, 3, 4, 5, 6})
        b.set(n, n);
    for (i = 0; i < 7; ++i)
        cout << b.get(i) << ' ';
    cout << '\n';
}

Last edited on
@dutch
when you set values don't you need to clear the alredy existed element first?
if you just "or" and the value is '111' for example you did nothing
Good point. I guess I should've tested it a little more. :-)
I've updated the post above.
If you want to be cache access efficient you have to match the cache line length of your microarchitecture (64bit in my case of intel skylake)
64 bytes.

Doing bit-packing means picking the "space" end of the time-space trade-off scale. I see little point in attempting to cache-optimize this, especially for particular microarchitectures.
Last edited on
you can still use a bitfield from C, or bitset from c++.
I do not know that this is any better than what was proposed, though it may be a little easier to read.
https://www.geeksforgeeks.org/bit-fields-c/
Last edited on
The important thing is to actually benchmark the performance of your code. A/B test with different implementations of how you want your data stored/accessed. It could very well be that doing all the fine-grained bitwise operations is actually harder for the processor to reason about than just giving it normal ints. You could still store the data in this "compressed" format, but it might turn out that it's better for it to be uncompressed when the data is actually being used.

I'm not making a determination/suggestion of which way is better, but I am saying that if you actually cared about performance, then you would measure and experiment with different methods, and never assume that method A is better than method B without evidence.
Last edited on
jonnin wrote:
you can still use a bitfield from C, or bitset from c++.

How do you store bitfields in arrays or extract multiple bits as a number from std::bitset?
How do you store bitfields in arrays
For example,
1
2
3
4
5
6
7
8
9
10
11
12
struct Foo{
    int bits0: 3;
    int bits1: 3;
    int bits2: 3;
    int bits3: 3;
    int bits4: 3;
    int bits5: 3;
    int bits6: 3;
    int bits7: 3;
    int bits8: 3;
    int bits9: 3;
}; //sizeof(Foo) == 4 
You may have to give up on maximum bit-packing efficiency, though. The compiler may not lay out a bit field member that crosses a machine word boundary. For example, for me a struct with 21 members has a size of 12, even though 21*3 == 63.
Last edited on
Topic archived. No new replies allowed.