speed optimalization for bitcount

I have a sparse bitcount version and wanted to make an optimalization in inline assembly.

(sparse bitcount):
1
2
3
4
5
6
7
8
9
10
int bitcount1(unsigned long long a){
 register unsigned int c=0;
 unsigned long long x = a;
 while(x) 
  {
  c++;
  x &=  x - 1;
  }
 return(c);
}


My version:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
__declspec(naked) int my_bitcount(unsigned long long a ){
 __asm {
  push ebp
  mov ebp, esp
  sub esp, __LOCAL_SIZE
  push ecx
  xor eax, eax
  mov edx, dword ptr 8[ebp]
  test edx, edx
  je SHORT L2
L1:
  mov ecx, edx
  sub edx, 1
  add eax, 1
  and edx, ecx
  jne SHORT L1
L2:
  mov edx, dword ptr 8[ebp + 4]
  test edx, edx
  je SHORT L4
L3:
  mov ecx, edx
  sub edx, 1
  add eax, 1
  and edx, ecx
  jne SHORT L3
L4:  
  pop ecx
  mov esp, ebp
  pop ebp
  ret
 }
}


Micrsofts verssion (Visual C++ 2008, full speed optimalization) returns this code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__declspec(naked) int bitcount(unsigned long long a){
 __asm {
 push esi
 push edi

 xor eax, eax

 mov ecx, 15
 mov edx, 240
 npad 2
L1:
 mov esi, ecx
 add eax, 1
 add esi, -1
 mov edi, edx
 adc edi, -1
 and ecx, esi
 and edx, edi
 mov esi, ecx
 or esi, edx
 jne SHORT L1

 pop edi
 pop esi
 ret 0
 }
}


if I make a speed test now (I call the function 1billion times in a loop) my bitcount is about 30% faster, but if I implent the function in my program (a tree search), the program is about 10% more slowly.
The hole rest of the program is written in C++.
How could I improve the speed of my bitcount function?
Last edited on
Topic archived. No new replies allowed.