Why is SIMD slower than trivial loop.

Hi everyone.
It is very wired that run SIMD code runs slower than Scalar code.

What I want to is compare differ as uint16_t with a uint16_t array.
std::vector<uint32_t> col_count: is a position vector for the offset, where in this offset the value is equal as differ.
vector8_int32_MatchTable: is a array, where we have constant time to access.

(Maybe it is hard to understand all the code, but I think the slow problem is at the structure)
Do I made a dummy error, when I try to use SIMD to optimize code instead of the Scalar loop?


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
 
 col_count.reserve(TUPLES);
size_t TUPLES = 1 << 16;
    #ifdef SIMD
      col_count.resize(TUPLES);
      size_t match_counter = 0;
      const uint32_t simdWidth = 16;
      const uint32_t numSimdIterations = dataLength / simdWidth;
      const __m256i comparisonValueVec = _mm256_set1_epi16(differ);
      uint32_t* writer = &col_count[0];
      for (uint32_t i = 0; i != numSimdIterations; i++) {
        uint32_t scanPos = i * simdWidth;
        __m256i attributeVec = _mm256_load_si256(reinterpret_cast<__m256i*>((basePointer + scanPos * 2)));
        __m256i selMask = _mm256_cmpeq_epi16(attributeVec, comparisonValueVec);
        uint16_t bitMask= _mm256_movemask_epi16(selMask);

        uint16_t bitMask_copy = bitMask;
        while (bitMask_copy) {
          match_counter += bitMask_copy & 1;
          bitMask_copy >>= 1;
        }

        // Lookup match positions and update positions vector
        auto& matchEntry0 = vector8_int32_MatchTable[bitMask & 0xFF];
        __m256i scanPosVec0 = _mm256_set1_epi32(scanPos);
        _mm256_storeu_si256(reinterpret_cast<__m256i*>(writer),
                            _mm256_add_epi32(scanPosVec0, _mm256_srai_epi32(matchEntry0.reg256, 8)));
        writer += static_cast<uint8_t>(matchEntry0.cell[0]);

        auto& matchEntry1 = vector8_int32_MatchTable[(bitMask >> 8) & 0xFF];
        __m256i scanPosVec1 = _mm256_set1_epi32(scanPos + 8);
        _mm256_storeu_si256(reinterpret_cast<__m256i*>(writer),
                            _mm256_add_epi32(scanPosVec1, _mm256_srai_epi32(matchEntry1.reg256, 8)));
        writer += static_cast<uint8_t>(matchEntry1.cell[0]);
      }
      col_count.resize(match_counter);
      col_count.shrink_to_fit();
    #endif



    #ifdef SCALAR      
    for (size_t in = 0 ; in < dataLength; in++) {
      if (*(basePointer + in) == differ) {
        col_count.push_back(in);
      }
    }
     #endif   
Last edited on
col_count.shrink_to_fit();

that is BRUTAL in standard c++. Try taking it out and re-do the time test (and any other S2F calls, I didn't see another but I didn't dig).
Last edited on
Yes.
The reason is when I starts I don't know how many can be matched so resize to the maximal 1 << 16.
Assuming the two loops produces the same result, shouldn't you at least call shrink_to_fit() in both SIMD and SCALAR to make the comparison more fair?

Are you sure the compiler isn't auto-vectorizing your SCALAR loop?
Last edited on
Hi Peter87. How to find is that auto-vectorizing optimized?
I don't know, other than looking at the generated assembly.
1. There is too much fluff in the SIMD section (or so it seems).
2. You omit a whole bunch of setup code.
3. You don't seem to be comparing the same thing in both sections anyway.

If you want us to TEST your ideas, we need something like this.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
int main ( ) {
    int numMatches = 0;
    // whatever else is needed in terms of input data
#ifdef SIMD
    // whatever, but it only increments numMatches with details.
    // no messing about with pushing to vectors or resizing.
#else // SCALAR
    for (size_t in = 0 ; in < dataLength; in++) {
      if (*(basePointer + in) == differ) {
        numMatches++;
      }
    }
#endif
    cout << "The number of matches=" << numMatches << endl;
}
or, the common sense approach, hand-count the operations.
resize (new, delete, memcpy (loop O(n))) 3 slow operations combined.
assign a pointer
loop
{multiply, call 3 functions of unknown complexity, … etc)

and so on quickly becomes obvious that its doing several orders of magnitude more work (and hitting memory all over creation, page faults, jumps and pipeline resets, etc)

vs
loop
pointer access, addition, comparison, increment .. a loop over 4 very simple operations (compared to resize, S2F, function calls, etc) with linear memory access and the only jump is the loop which is probably avoided or at least minimal impact.


Last edited on
The compiler is better at optimizing than you are. Just make sure you've set whatever options are needed to enable SIMD optimization.

Stick to the easy-to-code, easy-to-understand 5 line version at line 43-47. The 33 lines of mess that claim to do the same thing are hard to understand, hard to code, and prone to errors.

Don't do this sort of hand-optimization until the code is working and you're certain that you need to do it.
Thanks for all discussion.
I use cmake ninja and found following things contatining the flags is used by the ultimate gcc compiler.
Would that scalar code to vectorization optimized?

1
2
3
4
5
6
build CMakeFiles/tester.dir/test/datablockTest.cc.o: CXX_COMPILER__tester ../test/datablockTest.cc || cmake_object_order_depends_target_tester
  DEP_FILE = CMakeFiles/tester.dir/test/datablockTest.cc.o.d
  FLAGS = -mavx2 -O3 -DNDEBUG   -pthread -std=gnu++1z
  INCLUDES = -I../include -isystem vendor/gtm/gtest/include -isystem vendor/gflags/include -isystem vendor/gtm/gmock/include
  OBJECT_DIR = CMakeFiles/tester.dir
  OBJECT_FILE_DIR = CMakeFiles/tester.dir/test


But when I use this website compiler explorer:
https://gcc.godbolt.org/z/lW4QL0
The scalar code's assembly is normal...
Last edited on
Topic archived. No new replies allowed.