Which one is faster?

Pages: 123

-1 rollie (thus negating his -1, making it a +1)

Perhaps another statement from that chapter of C++ Coding Standards:

"It is far, far easier to make a correct program fast than it is to make a fast program correct."

In some systems/applications speed is paramount, but those are relatively few in number compared to those
in which speed is not important. And even in those applications where speed is required, not every algorithm need
be as fast as possible. It makes very little difference what sorting algorithm you use to sort an array of 10 elements
if you only do it once.

However, correctness, simplicity, and clarity are always a priority, because they are relevant to all programs.

Spending extra brain cells trying to optimize an algorithm that would run perfectly fine unoptimized is just a
waste of time--time that could have been spent solving other things that are actually problems.

Regarding the actual algorithms provided, +1 moorecm. Because std::fill_n is templatized, it can be specialized
for various data types in order to provide optimized solutions.

hamsterman (4538)

@rollie,
Are jumps really that slow? Anyway, this isn't really the same function, since it wont work when n%4 != 0. Also, line 2 is probably not what you were thinking about.

I think the fastest possible way to do this would be

void zeroArray(int* arr, int n){
    __asm{
        mov ecx, dword ptr[n]
        mov ebx, dword ptr[arr]
        zero_loop:
            mov dword ptr[ebx], 0
            add ebx, 4
        loop zero_loop
    }
}

Bazzy (6281)

IMO no hand written optimization can be better than a compiler optimization

Disch (13742)

I've heard that too, but I don't believe it.

warrant (14)

Why? i++ What that for? or ++i but~! i=i+n

chrisname (7395)

@Bazzy,
I disagree. I just think you should profile your code before micro-optimizing it, because you're more likely to slow it down if you don't check where your bottlenecks really are first.

@hamsterman,
I think inline asm is to be avoided when possible. If you need to use asm in a program (e.g. you're writing a bootloader or operating system) you can put it in a separate function, put a forward declaration in a header file, and then call it separately.

Example:
main.c

extern int multiply(int a, int b);

int main()
{
        printf("2 * 2 = %d", multiply(2, 2));
}

multiply.asm

multiply:
pop ebx
pop eax

mul ebx

ret

Last edited on

Athar (4466)

@hamsterman
The loop instruction is very slow, so you'll find this method is not very fast in comparison.
One should use the most obvious method - this is either fill_n or #1. The simpler the code, the higher the chance the compiler can transform it to something more adequate. For SSE targets, something "more adequate" is an SSE-optimized loop, which can be assumed to be equivalent to the memset implementation on such a target.

Like Wolf said, the second one doesn't offset the array every time so it is technically faster...but the difference here is soooooooooooo minimal anyway it'd be stupid to say it's 'omg micro-optimized code'.

There is no difference - at least on x86 targets, since they support addressing such as [eax+ebx*4+...].

EverBeginner (91)

High!

All of the above is invalid provided chips maker opts supply is next door to a vendor.
Moreover what is seen is in front of the scene.

closed account (z05DSL3A)

So who is going to be bothered with outputting the asm to compare the two?

Athar (4466)

So who is going to be bothered with outputting the asm to compare the two?

Sure. Compiled with gcc 4.4.1 -O3 and I extracted the core loops that do the actual work:

moorecm (fill_n):
.L5:
	movl	$0, (%edx)
	addl	$4, %edx
	subl	$1, %eax
	jne	.L5
Result: 2.00 clocks/int
	
xorebxebx #1:
.L10:
	movl	$0, (%ecx,%eax,4)
	addl	$1, %eax
	cmpl	%edx, %eax
	jne	.L10
Result: 2.00 clocks/int
	
xorebxebx #2:
.L17:
	movl	$0, (%eax)
	addl	$4, %eax
	cmpl	%eax, %edx
	ja	.L17
Result: 2.00 clocks/int
	
R0mai:
.L24:
	subl	$1, %edx
	movl	$0, (%eax)
	subl	$4, %eax
	testl	%edx, %edx
	jne	.L24
Result: 2.99 clocks/int (2.03-2.04 clocks in some cases)

bartoli (memset):
    call    memset
Result: 1.27 clocks/int

Note: -funroll-loops improves performance to 1.37 clocks/int for the first three methods and 1.40 clocks/int for the fourth method.

Now compiled with -O3 -march=native (on a Phenom II X4 945) (same order as above):

1.27 clocks/int
1.27 clocks/int
1.27 clocks/int
2.99 clocks/int
1.27 clocks/int

A representative extract from the first three methods:

	pxor	%xmm0, %xmm0
        [...]
.L22:
	movl	%edx, %eax
	incl	%edx
	sall	$4, %eax
	cmpl	%ecx, %edx
	movaps	%xmm0, (%edi,%eax)
	jb	.L22

Each test was performed 10000 times in a row on an int array with 2^18 elements.

Last edited on

closed account (EzwRko23)

Thank you very much Athar. Good job. :)

The result that ROmai's code is the slowest one is quite surprising.

I think that GCC perform this optimization by recognizing some common patterns in the code - the first two loops are very typical ones. The last time I checked, compilers had some problems with optimizing also the loop with pointers - at least gcc 4.2 could not unroll it and use SSE. Now things improved a lot, but still the reverse while loop needs more care.

hamsterman (4538)

The loop instruction is very slow

Are you saying its slower than

sub ecx, 1
jne some_label

? I very much doubt that.

SSE-optimized loop

What do you mean? I don't think SSE has something for loops (I don't know much about it though). Or do you mean unrolling?

One should use the most obvious method

I absolutely agree. My point was that if someone is feeling paranoid about optimisations (and I'm not saying one should be), it would be logical to use assembly (for such trivial tasks at least).

PanGalactic (1658)

If you want to see interesting assembler, compile the "memset()" code using g++ with "-O3 -march=k8", using -S to get the assembler output.

closed account (EzwRko23)

What do you mean? I don't think SSE has something for loops (I don't know much about it though). Or do you mean unrolling?

movaps	%xmm0, (%edi,%eax)

Loop unrolling + SSE enables to process more than one item per CPU cycle. This is called loop vectorization. To give the compiler enough room, you should always write the most obvious code as possible.

Last edited on

boolivar (27)

It is bad idea to write worst code with hope on compiler optimizer.

My answer is the second code block not slower than first in most cases.

When i use optimization it can be size optimization and compiler may don't try
to optimize for speed this code blocks.

Look at *array++ = 0; construction:
don't forget different processor architectures - e.g. AVR has "Store Indirect and Post-Increment" instruction and compiler will use it in this case.
array[i] = 0; will add i value on every iteration.

Among other loops do-while-loop is faster because no need to test condition before loop body.

About hand written optimization is it depends on skills of programmer and particular case.

This code is simpler:

void zeroArray(int* array, int n) {
  while (n--)
    *array++ = 0;
}

hamsterman (4538)

@xorebxebx,
Thanks. It would probably be good for me to learn some SSE stuff..
@boolivar,
Something to learn form this thread is that compilers are smarter that you think.

chrisname (7395)

Using SIMD instruction sets can provide a massive optimization; of course, there's no remedy for code that's bad anyway. As an example, I know a 2D game that uses SSE3 and a bunch of different optimization flags on the GCC command-line, and it runs at about 50 FPS on my computer. But as soon as you start drawing particles, the game slows down massively (it goes from ~50 FPS to 27 FPS) because the code for the particle simulator is horrible. Nested loops everywhere, it's illegible.

I was going to fix it but I can't be bothered to fix all the code. The original author put it all in one file, too, so as soon as you start moving functions into more appropriate places you break all kinds of dependancies. I wanted to make it somewhat modular and structured (the way I usually write programs is that there is one file - main.c - containing the main() function, and then a bunch of separate, isolated modules that don't even know about each other, and the main() function just uses those as it needs to), the MINIX kernel is an excellent example of a very well structured program. But the (game) code is full of horrible, circular dependancies. I'd rewrite it from scratch, but I don't know that I'd be able to do the particle physics.

Last edited on

Athar (4466)

hamsterman wrote:
Are you saying its slower than sub ecx, 1 jne some_label ? I very much doubt that.

See for yourself. When using loop instead of sub-jne, each loop iteration takes 4 instead of 2 CPU cycles.
Although just the fact that modern compilers don't use the loop instruction at all (not even when optimizing for size) should tip you off that loop is bad news.

hamsterman wrote:
My point was that if someone is feeling paranoid about optimisations (and I'm not saying one should be), it would be logical to use assembly (for such trivial tasks at least).

That wouldn't be a good idea, because hand-written assembler code is usually slower than what the compiler generates.
The compiler has detailed knowledge about the CPU that it is targeting - it knows how many cycles it takes for a particular instruction to complete and it can reorder instructions so that there are as few pipeline stalls as possible. An average programmer doesn't have the necessary knowledge to produce assembler code of similar quality.

boolivar (27)

hamsterman wrote:
Something to learn form this thread is that compilers are smarter that you think.

Really?

Compiler: avr-gcc (WinAVR 20090313) 4.3.2.

Code1:

void Zero1(unsigned char *buf, unsigned char size)
{
	for (unsigned char i = 0; i < 100; ++i)
		buf[i] = 0;
}

  b8:	20 e0       	ldi	r18, 0x00	; 0
  ba:	30 e0       	ldi	r19, 0x00	; 0
  bc:	fc 01       	movw	r30, r24
  be:	e2 0f       	add	r30, r18
  c0:	f3 1f       	adc	r31, r19
  c2:	10 82       	st	Z, r1
  c4:	2f 5f       	subi	r18, 0xFF	; 255
  c6:	3f 4f       	sbci	r19, 0xFF	; 255
  c8:	24 36       	cpi	r18, 0x64	; 100
  ca:	31 05       	cpc	r19, r1
  cc:	b9 f7       	brne	.-18     	; 0xbc <Zero1+0x4>
  ce:	08 95       	ret

Code2:

void Zero2(unsigned char *buf, unsigned char size)
{
	unsigned char *end = buf + size;
	while (buf < end)
		*buf++ = 0;
}

  d0:	fc 01       	movw	r30, r24
  d2:	86 0f       	add	r24, r22
  d4:	91 1d       	adc	r25, r1
  d6:	01 c0       	rjmp	.+2      	; 0xda <Zero2+0xa>
  d8:	11 92       	st	Z+, r1
  da:	e8 17       	cp	r30, r24
  dc:	f9 07       	cpc	r31, r25
  de:	e0 f3       	brcs	.-8      	; 0xd8 <Zero2+0x8>
  e0:	08 95       	ret

Conclusion: beauty code take beauty optimizations =)

closed account (EzwRko23)

This only shows that gcc is a poor compiler for avr or you compiled it without proper optimization options. Besides, they don't spend so much time writing optimizations for AVR as they are for Intel (and despite of that, I've heard many times they still loose to the Intel compiler team).

Last edited on

Pages: 123