Fastest Pocount Implementation

Hi guys,
I was wondering which way was the fastest for counting the bits set in a 64-bit long long int
I have been developing a chess engine lately and for performance reasons I need an extremely fast popcount method. I know it has been discussed endlessly here and on StackExchange, but I was unable to find a conclusive answer. Right now I am using:
__builtin_popcountll(long)
I am on Windows and using MinGW with Dev-C++.

Thank you!
You can use a bitset, it probably wont be faster than the GCC builtin but at least it wont be compiler specific.

1
2
3
4
5
6
7
8
9
#include <iostream>
#include <bitset>
 
using namespace std;
 
int main() {
	bitset<64> b(0xff00ff00ff00ff00);
	cout << b.count() << endl;
}
Use std::bitset<>. Enable hardware support for the pop count instruction (-march=). Measure.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#include <bitset>
#include <type_traits>
#include <limits>
#include <iostream>
#include <ctime>
#include <iomanip>

template < typename T > std::size_t popcnt( const T& v )
{
    static_assert( std::is_integral<T>::value && std::is_unsigned<T>::value, "an unsigned integral type expected" ) ;
    return std::bitset< std::numeric_limits<T>::digits >(v).count() ;
}

int main()
{
    const int NVALUES =  1'000'000'000 ;
    unsigned int r = 0 ;
    
    const auto start = std::clock() ;
    for( unsigned long long n = 0 ; n < NVALUES ; ++n ) r += popcnt(n) ;
    const auto end = std::clock() ;
    
    const unsigned long long popcnts_per_sec = NVALUES / double(end-start) * CLOCKS_PER_SEC ;
    
    std::cout << "about " << std::fixed << std::setprecision(0) << popcnts_per_sec << " popcnts per second, "
              << std::setprecision(3) << 1'000'000'000.0 / popcnts_per_sec << " nanoseconds per popcnt\n\n" ;
    return r ;
}

[    0.000000] Detected 2992.516 MHz processor.

LLVM
------
with hardware support for popcnt
about 684931506 popcnts per second, 1.460 nanoseconds per popcnt

without hardware support for popcnt
about 306748466 popcnts per second, 3.260 nanoseconds per popcnt


GNU
------
with hardware support for popcnt
about 657894736 popcnts per second, 1.520 nanoseconds per popcnt

without hardware support for popcnt
about 207900207 popcnts per second, 4.810 nanoseconds per popcnt

http://coliru.stacked-crooked.com/a/ccd72d3fcf4cea41
Why would a bitset be faster than a native implementation? I think it should be slower.
Your thoughts?
Why would a bitset be faster than a native implementation?

std::bitset<>::count() wouldn't be any faster; it would merely be as fast as the GNU intrinsic. (With SSE 4.2, both would generate the single instruction popcntq; https://en.wikipedia.org/wiki/SSE4.2#POPCNT_and_LZCNT).

However, code using std::bitset<> would be portable code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#include <bitset>
#include <type_traits>
#include <limits>

template < typename T > std::size_t popcnt( const T& v )
{
    static_assert( std::is_integral<T>::value && std::is_unsigned<T>::value, "an unsigned integral type expected" ) ;
    return std::bitset< std::numeric_limits<T>::digits >(v).count() ;
}

std::size_t foobar( unsigned long long v ) 
{ 
    return popcnt(v) ; 
    
    // -O3 -march=native
    /*
        popcntq	%rdi, %rax
    	retq
    */
}

http://coliru.stacked-crooked.com/a/4bb89bc999bcf6ba
I see. I'm on Windows and right now I have no intention to port the engine to other operating systems. If I was, I would have to use your suggestion.
Thank you!
Topic archived. No new replies allowed.