x86 Assembly - Indirect addressing

Mar 4, 2017 at 4:04am
In x86/-64, is there any point in doing, for example,

mov eax, [ebx+ecx*4]

instead of

imul ecx, 4
add ecx, ebx
mov eax, [ecx]

other than code size and having to modify a register?
Mar 5, 2017 at 5:51am
No expert on assembly but looks like one is more convenient than the other.
I'd also expect a difference in speed. I'd suggest benchmarking it.
Mar 6, 2017 at 3:34am
If you disassemble the results of that first does it not end up exactly like your second? I would bet the end machine instructions would be almost identical

If your assembler lets you do the first, there's no reason to do the longhand besides having readability, but I'd guess readability isn't your first concern if you're writing any significant portions of your code in assembly
Mar 6, 2017 at 6:27am
If you disassemble the results of that first does it not end up exactly like your second?
No, they're different opcodes.

If you're doing dynamic code generation, writing three simple instructions is easier than writing a single instruction with indirect addressing.
Mar 6, 2017 at 9:14pm
other than code size and having to modify a register?

Those are two very good reasons.
Mar 6, 2017 at 10:03pm
I'm not saying they're not, I'm asking if there are other reasons.
Mar 7, 2017 at 2:38pm
I'm asking if there are other reasons.

The only other reason I can think of is speed. Sometimes a sequence of smaller instructions is actually faster than a big complex instruction, although I'm not sure if that's true in this case.
Mar 13, 2017 at 12:30am
closed account (48T7M4Gy)
count the number of clocks and check for any redundant transfers or storage operations between the two cases.

eg http://www.fermimn.gov.it/linux/quarta/x86/index.htm
Last edited on Mar 13, 2017 at 12:32am
Mar 13, 2017 at 7:22am
That reference has uncertain applicability in a modern processor. I think the Intel manuals don't even include clock specs in the instruction reference.
Last edited on Mar 13, 2017 at 7:23am
Mar 13, 2017 at 1:53pm
closed account (48T7M4Gy)
And here's another one, keeping in mind X86 is a family of processors that is quite 'old' and how 64 bit fits into to the tables is not immediately clear. Intel and AMD don't appear to be all that forthcoming on the info.

How reliable these sites are is anybody's guess.

http://zsmith.co/intel_i.html#imul
Mar 13, 2017 at 3:38pm
other than code size and having to modify a register?

two registers: ecx and flags (imul will clear cf/of, add will clear sf)

as for relevant execution speeds, go to http://www.agner.org/optimize/instruction_tables.pdf -
looking at the Intel Skylake table there, a memory mov has latency 2 for all addressing modes, while imul alone is latency 3 (and is fixed to just one execution channel). Of course it only matters if your data are ready in L1 cache (such as because you're using [ebx+ecx*4] in a loop!)
Mar 13, 2017 at 4:10pm
Very cool reference. Thanks for that!

So if SHL was used instead of IMUL, does that mean that, since each instruction needs the result of the previous one, that doing SHL-ADD-MOV has a total latency of 3.5, while doing just MOV has a total latency of 2?
Mar 13, 2017 at 6:24pm
helios wrote:
doing SHL-ADD-MOV has a total latency of 3.5, while doing just MOV has a total latency of 2?

possibly, depending on how the CPU will optimize that code.

In fact, I'm going to give it a spin, because I like pointless benchmarks.

Executing each piece of code 10'000'000'000 times, on Xeon L5520, compiled with clang++. Switching the registers to rdi/rsi to match the C callling convetion

This probably could be better, but I have to get back to real work. full program: http://coliru.stacked-crooked.com/a/9dc5bf5abcc79780

mov version:
1
2
3
4
5
  400520:       bf 6c 09 60 00          mov    $0x60096c,%edi
  400525:       be 02 00 00 00          mov    $0x2,%esi
  40052a:       8b 04 b7                mov    (%rdi,%rsi,4),%eax
  40052d:       48 ff c9                dec    %rcx
  400530:       75 ee                   jne    400520 <main+0x10>


CPU Time: 8.094s
Instructions Retired: 50,022,174,000
CPI Rate: 0.408


shl version
1
2
3
4
5
6
7
  400520:       bf 6c 09 60 00          mov    $0x60096c,%edi
  400525:       be 02 00 00 00          mov    $0x2,%esi
  40052a:       48 c1 e6 02             shl    $0x2,%rsi
  40052e:       48 01 fe                add    %rdi,%rsi
  400531:       48 8b 06                mov    (%rsi),%rax
  400534:       48 ff c9                dec    %rcx
  400537:       75 e7                   jne    400520 <main+0x10>

CPU Time: 12.192s
Instructions Retired: 70,037,218,000
CPI Rate: 0.439


imul version
1
2
3
4
5
6
7
  400520:       bf 6c 09 60 00          mov    $0x60096c,%edi
  400525:       be 02 00 00 00          mov    $0x2,%esi
  40052a:       48 6b f6 04             imul   $0x4,%rsi,%rsi
  40052e:       48 01 fe                add    %rdi,%rsi
  400531:       48 8b 06                mov    (%rsi),%rax
  400534:       48 ff c9                dec    %rcx
  400537:       75 e7                   jne    400520 <main+0x10>

CPU Time: 12.124s
Instructions Retired: 70,038,126,000
CPI Rate: 0.437


looks like shl and imul took the same time (the diff is noise, it skewed the other way on another run)
Mar 13, 2017 at 10:52pm
Curious how for SHL the time increased by 50%, as predicted by the latency numbers, but not for IMUL.
I guess that settles that.
Mar 14, 2017 at 3:55am
I am guessing the cpu saw imul with a power of 2 and executed an shl instead
Mar 14, 2017 at 4:51am
I guess, but it's interesting that the CPU has time to performs those kinds of checks.
Topic archived. No new replies allowed.