difference in CPU time for two similar lines, thanks

Hi all. There is a while loop in my program, where IterZNext, IterZ are pointers to nodes in a list. The nodes in the list are of type struct with a field called "Index".
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
double xx = 20.0;
double yy = 10000.0;
double zz;		
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
	IterZ=IterZNext;
	IterZNext = IterZ->Next;
	if (IterZNext!=NULL)
	{
		zz = xx + yy;
                NextIndex1 = IterZNext->Index; // line (*)
		NextIndex = IterZNext->Index;  // line (**)
		IterZNext->Index;
	}
}


When I profiled my program, I found the line (*)

NextIndex1 = IterZNext->Index;

consumes most of CPU time (2.193s), while the line (**)

NextIndex = IterZNext->Index;

which is all most the same with the line (*) only uses 0.093s. I used the Intel VTune Amplifier to see the assembly of these two lines, which is as follows:
1
2
3
4
5
6
7
8
9
10
11
12
	
Address	Line	Assembly	               CPU Time Instructions Retired
Line (*):
0x1666	561	mov eax, dword ptr [ebp-0x44]	0.015s	50,000,000
0x1669	561	mov ecx, dword ptr [eax+0x8]		
0x166c	561	mov dword ptr [ebp-0x68], ecx	2.178s	1,614,000,000

Line (**):
0x166f	562	mov byte ptr [ebp-0x155], 0x1	0.039s	80,000,000
0x1676	562	mov eax, dword ptr [ebp-0x44]	0.027s	44,000,000
0x1679	562	mov ecx, dword ptr [eax+0x8]		
0x167c	562	mov dword ptr [ebp-0x5c], ecx	0.026s	94,000,000


If I change the order of the line (*) and the line (**), then the program changes to

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
double xx = 20.0;
double yy = 10000.0;
double zz;		
while (IterZNext!=NULL && NextIndex<=NewIndex)
{
	IterZ=IterZNext;
	IterZNext = IterZ->Next;
	if (IterZNext!=NULL)
	{
		zz = xx + yy;
                NextIndex = IterZNext->Index;  // line (**)
                NextIndex1 = IterZNext->Index; // line (*)
		IterZNext->Index;
	}
}


and the result for assembly changes to

1
2
3
4
5
6
7
8
9
10
11
Address	Line	Assembly	CPU Time	Instructions Retired
Line (**):
0x1666	560	mov byte ptr [ebp-0x155], 0x1	0.044s	84,000,000
0x166d	560	mov eax, dword ptr [ebp-0x44]	0.006s	2,000,000
0x1670	560	mov ecx, dword ptr [eax+0x8]	0.001s	4,000,000
0x1673	560	mov dword ptr [ebp-0x5c], ecx	1.193s	1,536,000,000

Line (*):
0x1676	561	mov eax, dword ptr [ebp-0x44]	0.052s	128,000,000
0x1679	561	mov ecx, dword ptr [eax+0x8]		
0x167c	561	mov dword ptr [ebp-0x68], ecx	0.034s	112,000,000


In this case, line (**) uses most of CPU time (1.245s) while line (*) only uses 0.086s.

Could someone tell me:
(1) Why does it take so long to make the first assignment? Notice that the line zz=xx+yy only uses 0.058s. Is this related to the cache misses? since all nodes in the list are dynamically genereated.
(2) Why is there huge difference in CPU time between this two lines?

Thanks!

Last edited on
Interesting. This is a topic I know NOTHING about, hehe. But even with my ignorance, I'll guess: Your compiler detects that the second assignment uses the same value as the first one, and then the code for the second assignment doesn't need to move the right-hand side into the register again and it just uses the value used before.
thanks for your reply webJose! but it is unusual that it takes so long to make the first assignment. (Notice that the line zz=xx+yy only uses 0.058s.) Maybe related to the cache misses? since all nodes in the list are dynamically genereated.
Well, why it takes so long to run the first assignment is an entirely different question than "why there is such difference in CPU between assignment 1 and 2?". I guessed an answer for your original question; for this new question of yours, I have no answer or guess. :-(
Thanks, webJose! but from the profile of assembly, for the 2nd assignment, it seems to me that the RHS is moved into the register:

Line (*):
0x1666 561 mov eax, dword ptr [ebp-0x44] 0.015s 50,000,000
0x1669 561 mov ecx, dword ptr [eax+0x8]
0x166c 561 mov dword ptr [ebp-0x68], ecx 2.178s 1,614,000,000


Line (**):
0x166f 562 mov byte ptr [ebp-0x155], 0x1 0.039s 80,000,000
0x1676 562 mov eax, dword ptr [ebp-0x44] 0.027s 44,000,000
0x1679 562 mov ecx, dword ptr [eax+0x8]
0x167c 562 mov dword ptr [ebp-0x5c], ecx 0.026s 94,000,000
waiting for reply....
Bear in mind this is all intelligent guesswork:

1) The stack frame will always be in the cache because you're accessing it all the time. So any reads or writes around ebp will be fast.
2) The instruction at 0x1669 is the dereference of IterZNext (eax is the value of IterZNext and Index is 8 bytes into the struct)
3) This dereference is likely to cause a cache miss as we have had no reason to go anywhere near this bit of memory before.
4) The instruction at 0x1679 does the dereference again, but this time the memory is in the cache. As we haven't accessed anything to cause the cache to drop it (remember the stack frame is always in cache)

I think the rather more interesting thing is the instruction at 0x166f. Which local variable is at ebp-0x155 and why are we setting it to 0x1?
Ah, but wait!!


Nah! I still got nothing. :-)
Thanks, kev82. I think you are probably right. Just one more question: if the instruction at 0x1669 is the dereference of IterZNext which causes the cache miss, why does the instruction at 0x166c rather than the instruction at 0x1669 consume most of CPU time? Is this a bug of Intel VTune Amplifier?
I don't know anything about the VTune Amplifier, but as there are no times next to 0x1669, I assume the times next to 0x166c is for both?
Topic archived. No new replies allowed.