Hi everyone.
It is very wired that run SIMD code runs slower than Scalar code.
What I want to is compare differ as uint16_t with a uint16_t array.
std::vector<uint32_t> col_count: is a position vector for the offset, where in this offset the value is equal as differ.
vector8_int32_MatchTable: is a array, where we have constant time to access.
(Maybe it is hard to understand all the code, but I think the slow problem is at the structure)
Do I made a dummy error, when I try to use SIMD to optimize code instead of the Scalar loop?
1. There is too much fluff in the SIMD section (or so it seems).
2. You omit a whole bunch of setup code.
3. You don't seem to be comparing the same thing in both sections anyway.
If you want us to TEST your ideas, we need something like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
int main ( ) {
int numMatches = 0;
// whatever else is needed in terms of input data
#ifdef SIMD
// whatever, but it only increments numMatches with details.
// no messing about with pushing to vectors or resizing.
#else // SCALAR
for (size_t in = 0 ; in < dataLength; in++) {
if (*(basePointer + in) == differ) {
numMatches++;
}
}
#endif
cout << "The number of matches=" << numMatches << endl;
}
or, the common sense approach, hand-count the operations.
resize (new, delete, memcpy (loop O(n))) 3 slow operations combined.
assign a pointer
loop
{multiply, call 3 functions of unknown complexity, … etc)
and so on quickly becomes obvious that its doing several orders of magnitude more work (and hitting memory all over creation, page faults, jumps and pipeline resets, etc)
vs
loop
pointer access, addition, comparison, increment .. a loop over 4 very simple operations (compared to resize, S2F, function calls, etc) with linear memory access and the only jump is the loop which is probably avoided or at least minimal impact.
The compiler is better at optimizing than you are. Just make sure you've set whatever options are needed to enable SIMD optimization.
Stick to the easy-to-code, easy-to-understand 5 line version at line 43-47. The 33 lines of mess that claim to do the same thing are hard to understand, hard to code, and prone to errors.
Don't do this sort of hand-optimization until the code is working and you're certain that you need to do it.
Thanks for all discussion.
I use cmake ninja and found following things contatining the flags is used by the ultimate gcc compiler.
Would that scalar code to vectorization optimized?