Interestingly, here is what I get for the basic sum implementation: Java: ~1470 gcc 4.7 with -O3: ~960 clang 3.1 with -O3: ~1350 optimized C++ version: gcc 4.7: ~1350 clang: ~1470 Also, if you replace your use the std::partial_sum function, with gcc 4.7, you get ~1100, but it get slower with clang … |
Faster execution speeds(because it's fully compiled) |
std::partial_sum
as the actual default C++ approach.Daniel Lemire wrote: |
---|
Of course, from a sample of 3 compilers on a single problem, I only provide an anecdote |
straight sum (C-like) 381.679 basic sum (C++-like) 413.223 iterator-based sum (C++-like) 413.223 <- had to fix this one std::partial_sum 409.836 <- added this one ...the "smart" sums were all much slower than Java |
0x00007f26b505fb01: mov 0x10(%rbx,%rbp,4),%r8d 0x00007f26b505fb06: add 0xc(%rbx,%rbp,4),%r8d 0x00007f26b505fb0b: mov %r8d,0x10(%rbx,%rbp,4) 0x00007f26b505fb10: movslq %ebp,%r11 0x00007f26b505fb13: add 0x14(%rbx,%r11,4),%r8d 0x00007f26b505fb18: mov %r8d,0x14(%rbx,%r11,4) 0x00007f26b505fb1d: add 0x18(%rbx,%r11,4),%r8d 0x00007f26b505fb22: mov %r8d,0x18(%rbx,%r11,4) 0x00007f26b505fb27: add 0x1c(%rbx,%r11,4),%r8d 0x00007f26b505fb2c: mov %r8d,0x1c(%rbx,%r11,4) 0x00007f26b505fb31: add 0x20(%rbx,%r11,4),%r8d 0x00007f26b505fb36: mov %r8d,0x20(%rbx,%r11,4) 0x00007f26b505fb3b: add 0x24(%rbx,%r11,4),%r8d 0x00007f26b505fb40: mov %r8d,0x24(%rbx,%r11,4) 0x00007f26b505fb45: add 0x28(%rbx,%r11,4),%r8d 0x00007f26b505fb4a: mov %r8d,0x28(%rbx,%r11,4) 0x00007f26b505fb4f: add %r8d,0x2c(%rbx,%r11,4) 0x00007f26b505fb54: add $0x8,%ebp 0x00007f26b505fb57: cmp %r10d,%ebp 0x00007f26b505fb5a: jl 0x00007f26b505fb01 |
-O3
without -funroll-loops
- this test is indeed simple enough for JIT to be competitive (ignoring startup, etc)