The use of unsigned integers has many problems but if we use signed integers for our indices and want to check that the index is in bounds we need to add another check to see if it is not negative. I don't mind the extra code but I would have hoped the compiler to be able to optimize this but apperantly that is not the case.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
bool in_bounds(int a, int b)
{
return a >= 0 && a < b;
}
bool in_bounds(unsigned a, unsigned b)
{
return a < b;
}
in_bounds(int, int):
mov eax, edi
not eax
shr eax, 31
cmp edi, esi
mov edx, eax
setl al
and eax, edx
ret
in_bounds(unsigned int, unsigned int):
cmp edi, esi
setb al
ret
The problem seems to be that b could be negative. If b was known to be positive it should be safe to treat a and b as unsigned and do a single comparision.
1 2 3 4 5 6 7
bool in_bounds(int a, int b)
{
if (b >= 0)
{
return a >= 0 && a < b;
}
}
The above assembly output is for GCC which seems to understand this. Clang doesn't.
Unfortunatly it seems to be very reluctant to do this optimization. If I make the b < 0 branch not UB it doesn't do this optimization even if b >= 0 is guaranteed to be true in the branch that does the comparison.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
bool in_bounds(int a, int b)
{
if (b >= 0)
{
return a >= 0 && a < b;
}
returnfalse;
}
in_bounds(int, int):
xor eax, eax
test esi, esi
jns .L5
ret
.L5:
mov eax, edi
not eax
shr eax, 31
cmp esi, edi
mov edx, eax
setg al
and eax, edx
ret
Assuming I have a good reason for wanting the code to be as efficient as possible, what is the best way of writing this? I don't really like the way that I wrote it above because it might give a compiler warning and it doesn't work on all compilers. Using assert and casting to unsigned is probably better because it makes the intention clear and errors easier to find.
1 2 3 4 5
bool in_bounds(int a, int b)
{
assert(b >= 0);
returnstatic_cast<unsigned>(a) < static_cast<unsigned>(b);
}
I think this is one of the things that the contracts proposal aims to fix, but then compilers need to be more brave and allow themselves to optimize based on the preconditions, otherwise you'll still have to do hacks like this.
Does this checking need to be done in normal operation or only in development?
The nice features about assert for checks are that (a) (as you say), it makes intention obvious; and (b) it can simply be turned off when required by putting #define NDEBUG
in a header.
All mainstream fortran compilers have a (non-default) operation to check array bounds, but my experience there is that it significantly slows down code execution. I can't find comparable options for the c/c++ compilers that I use (though that may be my lack of knowledge).
I'm curious why b could be negative when checking bounds... if b is representative of the upper bound, wouldn't it usually be sourced from a const that you have full control over, or from a something.size() ? You might even be able to play these meta-games further and conclude b is never 0.
Does this checking need to be done in normal operation or only in development?
That's a good point. Most cases were I would write such bounds checking would be in assert macros where performance isn't important.
The nice features about assert for checks are that (a) (as you say), it makes intention obvious; and (b) it can simply be turned off when required by putting
#define NDEBUG
in a header.
Yes, but unfortunately it doesn't allow the compiler to assume the assertion always holds.
All mainstream fortran compilers have a (non-default) operation to check array bounds, but my experience there is that it significantly slows down code execution.
It's not like I want every access to be checked. But when I do, it would be nice if the code for signed indices was as efficient as for unsigned indices. An argument for using signed integers that I have heard quite a few times is that it can make the code faster because the compiler can assume it doesn't overflow/wrap. I don't think it's the most important argument but it still ...
Anyhow, I would love if the standard library transitioned into using signed indices, perhaps in std2. It would make C++ a much simpler language to learn and teach.
icy1 wrote:
I'm curious why b could be negative when checking bounds...
That's the point. I know that b can't be negative, but the compiler doesn't, and therefore it generates code that is not as fast as it could have been.
That's the point. I know that b can't be negative, but the compiler doesn't, and therefore it generates code that is not as fast as it could have been.
What I mean here is a sort of hybrid -- if you're always passing in size_t for b, for example, and never doing something like double indices from each side of the array:
1 2 3 4
bool in_bounds3(int a, unsigned b)
{
returnstatic_cast<unsigned>(a) < b;
}
return (b-a) < 0; //if a > b, this is false, if a = b, this is false, is this cheaper?
Yes, it generates less assembly instructions. With Clang it's actually identical to my unsigned version.
icy1 wrote:
What I mean here is a sort of hybrid -- if you're always passing in size_t for b, for example, and never doing something like double indices from each side of the array
I realize that I have probably been worrying too much about this "issue". Bounds checking like this is not something you would normally want to do very often outside debug checks anyway. If you need it in time critical code you should probably rewrite it in such a way that it isn't needed. Most of the time it wouldn't be worth thinking about, but for something as widely used as std::vector, if they started to use signed indices, I would expect the at() function to use casts or something internally to make it no slower than for unsigned indices.
signed index is really useful now and then. consider... lets say you had a small integer, say +- 127, but you had 10 to the 1000 of the things to sort.
hmm.
long long sorted [300]; //bigger than needed, feeling lazy today
long long * magic = &sorted[150];
for(all the data)
magic[datavalue]++; //signed bucket sort. magic[-100] is ok!
This used to work. I haven't had to do it in a while, but I suspect it sets off all kinds of warnings today, but hopefully it does still work? IMHO all array types (including vector) should support this kind of thing -- its not handy often, but when its what you need, it should be there for you.
Not really a fan of automatic bounds checking though. Even if its 1 clock cycle, its a waste of time for most code. You need it in just a few places usually, user input hack defenses / buffer overflow protection as one good use of it.