How can I speed up this very simple loop?

I've never really paid much attention to low-level optimization before, but now I have a program that spends 90% of its time in this loop, running it many billions of times. The Visual Studio profiler reports that within this loop, most of the time is spent on the < and == operations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
for(k = 0; k < M; ++k)
{
   if(dm[i][k] < dm[j][k])
   {
      ++less;
   }
   else
   {
      if(dm[i][k] == dm[j][k])
      {
         ++equal;
      }
      else
      {
         ++more;
      }
   }
}


dm is a double array with dimensions on the order of [10000][10] and the three variables that are incremented are just ints. It doesn't get much simpler than that, but does anyone know if or how I could speed this up? Inline assembly? Some sort of fancy bit-level manipulation? Thanks for any help~
The first thing you can do is drop the test for equality (at least as you've defined it) since it isn't really possible to compare two computed doubles for equality. That reduces the code to:

1
2
3
4
5
6
7
for( int k = 0; k < M; ++k )
{
    if( dm[i][k] < dm[j][k] )
        ++less;
    else
        ++more;
}


which eliminates one of the comparisons.

Secondly, are you using the FPU or are you emulating?

Thirdly, the reason why you're spending all your time there is because you are essentially constantly hitting RAM. Access of less/equal/more can be in the processor cache.
Great, thanks for the quick reply. I have some questions though:

1) I think I know what you mean, but I do have lots of cases like if(1.234e9 == 1.234e9) because the exact same calculations can certainly be repeated. Is it true even in that case that two doubles will not be equal?

2) I'm not using a virtual machine or anything like that. Is that what you mean? Is there a chance that it wouldn't be using the FPU if it's just compiled and run directly in Windows Server 2003?

3) Again I think I know what you mean, but that's pushing the limits of my computer architecture knowledge. I know about register variables but I wasn't aware of other ways to force data to live closer to the CPU. Are you saying that there's some way to force the array values involved in the comparisons to live someplace other than RAM?
1. What I'm saying is this:

1
2
3
4
5
6
int main() {
    double d1 = 3.14;
    double d2 = 3.14;
    assert( d1 == d2 );   // This is OK because neither d1 nor d2 were computed
    double d3 = d2 - 1.0;
    assert( d3 + 1.0 == d1 );  // This is not necessarily OK because there is a computation involved 


2. I think some compilers can provide software emulation of floating point instead of using the FPU. It would probably be a special option. This may be dated nowadays since all CPUs have FPUs. It didn't used to be that way.

3. There are L1 and L2 caches, each of which sit between the CPU and RAM. L2 cache can be accessed by the processor much faster than RAM, and L1 cache can be accessed even faster still. L1 cache is very small; L2 cache is generally a couple MB. less, more, and equal can all be stored in one of the caches. And they are "hit" frequently, meaning they will stay in the cache. But your arrays are 100000 doubles each, which is 800K each. They don't fit so well in the cache. And each time through the loop, you are hitting different elements in those arrays, meaning that it is possible that there are a lot of cache misses, meaning the values have to be retrieved from RAM. There's a tool on linux called cachegrind that will tell you how frequently you are hitting and missing the cache. I'm not a windows expert so I don't know offhand if such a tool exists there.

You might consider generating the assembler output for the loop to see how well the compiler is optimizing it. (You are using optimizations, right?)

Also, M is a constant, right? Ie, not a function.
OK, thanks again for your reply. You've raised some issues that I definitely need to think about more. As far as point 3, I experimented with setting some of the array values to constants:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
for(k = 0; k < M; ++k)
{
   if(0.5 < dm[j][k])
   {
      ++less;
   }
   else
   {
      if(0.5 == dm[j][k])
      {
         ++equal;
      }
      else
      {
         ++more;
      }
   }
}


and there seems to be no change in the speed. Given this and the fact that I have all the optimization options turned on, do you think it's safe to conclude that MSVC is probably optimizing everything more or less as much as it can possibly be optimized?
Maybe unrolling the loop could help a bit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include <iostream>
#include <ctime>

void loop(int * arr1, int * arr2)
{
    for (int i=0; i<5000; i++)
    {
        arr1[i]=arr2[i];
    }
}

void unrolled_loop(int * arr1, int * arr2)
{
    for (int i=0; i<5000; i+=5)
    {
        arr1[i]=arr2[i];
        arr1[i+1]=arr2[i+1];
        arr1[i+2]=arr2[i+2];
        arr1[i+3]=arr2[i+3];
        arr1[i+4]=arr2[i+4];
    }
}

int main()
{
    int arr1[5000];
    int arr2[5000];

    int start;
    int stop;

    start=clock();
    for (int i=0; i<50000; i++) loop(arr1,arr2);
    stop=clock();

    std::cout << stop-start << std::endl;

    start=clock();
    for (int i=0; i<50000; i++) unrolled_loop(arr1,arr2);
    stop=clock();

    std::cout << stop-start << std::endl;

    return 0;
}

I compile with g++ using -O3 and get
156 (sometimes 171) ms for the first function and
125 (sometimes 110) ms for the second one.

I thought about calling the loop with pointers to the [i] and [j] indexes,
to see if it would help. Sometimes it is faster, however, I've seen times be equal or a little slower.

Your test for 0.5 may be better if you took the absolute of i - j giving yourself a +/- 0.5 range to be considered equal.
Since using floating point you will need to use fabs (or is using abs ok in C++?).
I also tried a dabs which works out to be about the same speed as fabs.

edit: Just realized your equal weight test must be done first, then check for lesser or greater, otherwise you may be missing some lesser values that fall within the equal weight range.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#include <iostream>
#include <ctime>
#include <cstdlib>
#include <cmath>
using namespace std;

double dabs(const double v)
{
    return ( v < 0 ? v*-1.0 : v );
}


void loop1(const int M, const int i,const int j,double const dm[][1000] ) 
{
    int less = 0, more  = 0, equal = 0;
    for (int k=0; k<M; k++)
    {
        //cout << "dm: " << j << " "<< k << " " << dm[j][k] << endl;
        if ( fabs( dm[i][k] - dm[j][k] ) <= 0.5 ) 
            equal++;
        else if ( dm[i][k] < dm[j][k] )
            less++;
        else
            more++;
    }
    cout << "Less: " << less 
         << " More: " << more
         << " Equal: " << equal
         << endl;

}

void loop2(const int M, const double * arr1, const double * arr2)
{
    int less = 0, more  = 0, equal = 0;
    for (int k=0; k<M; k++)
    {
        //cout << "a1: " << arr1[k] << " a2:"<< arr2[k] << endl;
        if ( fabs( arr1[k] - arr2[k] ) <= 0.5 ) 
            equal++;
        else if ( arr1[k] < arr2[k] ) 
            less++;
        else
            more++;
    }
    cout << "Less: " << less 
         << " More: " << more
         << " Equal: " << equal
         << endl;

}



int main()
{
    double dm[1000][1000];
    int start;
    int stop;
    int Max = 500;

    srand((unsigned)time(0)); // set seed
    for (int i=0;i<1000;i++)
        for (int j=0;j<1000;j++)
            dm[i][j] = 1000 * rand() / (RAND_MAX + 1.0);

    int iseq[10000];
    int jseq[10000];
    for (int x=0;x<10000;x++)
    {
        iseq[x] = int(rand()%1000);
        jseq[x] = int(rand()%1000);
    }

    start = clock();
    for (int x=0;x<10000;x++)
    {
        loop1(Max,iseq[x],jseq[x],dm);
    }
    stop = clock();
    int total_time1 = stop - start;


    start = clock();
    for (int x=0;x<10000;x++)
    {
        loop2(Max,dm[iseq[x]],dm[jseq[x]]);
    }
    stop = clock();
    int total_time2 = stop - start;

    cout << "Time Loop1: " << total_time1 << endl;
    cout << "Time Loop2: " << total_time2 << endl;


    return 0;
}



Sample output:


Less: 270 More: 123 Equal: 107
Less: 232 More: 150 Equal: 118
Less: 244 More: 154 Equal: 102
Time Loop1: 60000
Time Loop2: 70000

Less: 239 More: 140 Equal: 121
Less: 261 More: 125 Equal: 114
Less: 232 More: 150 Equal: 118
Less: 254 More: 147 Equal: 99
Time Loop1: 70000
Time Loop2: 60000



Rich
Last edited on
Topic archived. No new replies allowed.