Code objects alignment, memory and execu

Forum

Forum
General C++ Programming
Code objects alignment, memory and execu

Code objects alignment, memory and execution by CPU in C++

Pages: 12

Hi all,

I've got some info but would like to know how exactly a C++ code's objects with regard to alignment are stored in memory and then fetched by CPU for execution (producing the output) in modern systems.

int main() {
short sh {2};
int a [] {1,2,3};

for(int i=0; i<3; ++i)
  a[i] *= sh;
}

This is just a simple code to start with. My intention is to know the story comprehensively.

Last edited on

kigar64551 (815)

In structs, padding is added automatically, so that each field is located at an offset (from the start of the struct) that is an integer multiple of the field's size. For example, an int field would located at an offset that is an integer multiple of 4 (assuming sizeof(int) == 4), a double field would be located at an offset that is an integer multiple of 8 (assuming sizeof(double) == 8), and so on...

Compilers usually have ways to disable the padding/alignment in structs. For example, GCC can use:
__attribute__((__packed__))

I think the alignment of "global" and "local" variables is platform-specific, but usually the default alignment is sufficient for all "scalar" types, so probably (at least) 8 on modern 64-Bit systems. It may not be sufficient for some "vector" types (SSE, AVX, etc. pp.), though...

Also, this may depend on the specific compiler. You often read the older compilers had "alignment issues" 😏

Again, most compilers have ways to explicitly specify the (minimum) alignment of a variable. For example, GCC can use:
__attribute__((aligned(N))

As far as dynamically allocated memory is concerned, I think malloc() and friends do not give you any guarantees that the start address of the allocated memory block is aligned to anything bigger than 1 – even though, in reality, malloc() on most platforms has a certain minimum alignment. There are functions like _aligned_malloc() (MSVC) or posix_memalign() (Unix) to request a specific alignment.

(Effectively, these functions allocated a somewhat bigger block and then shift the start address as needed)

On the hardware-level, I think that, on most CPUs, the load or store instructions require that the source/target memory address is a multiple of N, when loading/storing an element of size N (in bytes). Often "unaligned" loads or stores are supported by the CPU, but they are less efficient, because each "unaligned" load or store effectively has to be translated into multiple "aligned" loads or stores!

I think "vector" extensions, like SSE, often have stricter alignment requirements and will cause an exception when trying an "unaligned" load or store. Later versions of SSE added support for "unaligned" load/store, but still with certain performance penalty...

Last edited on

seeplus (6597)

When dealing with a struct, MSVC has the 'struct member align' property which specifies the byte boundary (1, 2, 4, 8, 16 or default) byte alignment for struct members. See:
https://learn.microsoft.com/en-us/cpp/cpp/alignment-cpp-declarations?view=msvc-170

There is also align in C++
https://en.cppreference.com/w/cpp/memory/align

See also:
https://en.cppreference.com/w/cpp/language/alignas
https://en.cppreference.com/w/cpp/language/alignof
https://en.cppreference.com/w/cpp/types/alignment_of

frek (576)

If we focus on the example as the start point, I assume sh is located at an address to be a multiple of 2 and a (likely somewhere else on memory) at an address to be a multiple of 4 (occupying a three 4-byte boundary):

&sh =   00000069600FFB54
&a[0] = 00000069600FFB78
&a[1] = 00000069600FFB7C
&a[2] = 00000069600FFB80

If we tune out structs/classes, I assume different objects and arrays are stored at memory depending on their type size/alignment and not necessarily back-to-back (contiguously) although the have been declared that way in the code. Right up to this point?

kigar64551 (815)

Yes, if you do not explicitly specify an alignment, e.g. via __attribute__((aligned(N))), then the compiler will use the "default" alignment for the target platform. This probably means that "global" or "local" variables will be aligned to an integer multiple of their type size.

Consequently, the short variable likely would be aligned to a multiple of 2, and that the int array (and therefore each element of the array) likely would be aligned to a multiple of 4 – assuming that sizeof(short) == 2 and that sizeof(int) == 4.

I don't think that, in general, you can assume that separate variables are stored "back to back" in memory, even if they are defined immediately after each other in the source code. It could very well happen that they are stored "back to back" in memory, but don't rely on it. They certainly will not be stored "back to back", if additional padding is required to ensure the requested (or default) alignment.

Consider:

short x = 42;
short y = 43;
long  z = 44;

int main() {
        printf("%zu, %p\n", sizeof(x), &x);
        printf("%zu, %p\n", sizeof(y), &y);
        printf("%zu, %p\n", sizeof(z), &z);
}

2, 0x564b6dbdf010
2, 0x564b6dbdf012
8, 0x564b6dbdf018

As we can see, the two short variables are aligned to a multiple of 2, whereas the long variable is aligned to a multiple of 8. Because of this, the short variables happen to be stored "back to back", but there needs to be some extra space/gap (4 bytes) before the long variable.

For comparison:

__attribute__((aligned(16))) short x = 42;
__attribute__((aligned(16))) short y = 43;
__attribute__((aligned(16))) long  z = 44;

int main() {
        printf("%zu, %p\n", sizeof(x), &x);
        printf("%zu, %p\n", sizeof(y), &y);
        printf("%zu, %p\n", sizeof(z), &z);
}

2, 0x562f1ba32010
2, 0x562f1ba32020
8, 0x562f1ba32030

Now everything is aligned to a multiple of 16, because that's what we requested – resulting in even more extra space/gaps.

Last edited on

frek (576)

short sh{ 2 };
short sh2{ 3 };
long long d{ 20};

std::cout << "&sh =  " << &sh << '\n';
std::cout << "&sh2 = " << &sh2 << '\n';
std::cout << "&d =   " << &d << '\n';

&sh =  00000093A7CFF994
&sh2 = 00000093A7CFF9B4
&d =   00000093A7CFF9D8

frek (576)

Apart from how close or far the objects declared in a code are stored on memory, I'd like to know about the computations. As far as I'm concerned, the data (those objects' values) are fetched from memory into caches and then into registers by which CPU performs computations. Are computations necessarily performed using registers only?
I have also heard of memory operands, register operands, and immediate operands.

jonnin (11441)

the names are misleading. A memory operand isn't an activity done in memory or outside the cpu; its something like (for intel) mov [ebx], eax where the [ebx] is a pointer (the pointer memory location is in ebx, but the data is in memory at the pointer offset) -- if I got that right (Its been many years since I did assembly directly-- I read it now and then but no longer write it). If you are doing math or anything on it, it still has to go to a register: all you can do with them directly is shuffle to and from.

your model is correct for all systems that I have used (memory to cache to cpu register then perform actions, then reverse it to store).

there may be a small # of tricks you can play (not counting offloading work to a GPU or other coprocessor type setup) to get a piece of hardware to do a computation for you. I am not aware of any where this is useful because the time spent moving it to and from an external is more than it would take to do it in the CPU.

Last edited on

kigar64551 (815)

The CPU cache stores chunks of memory that are called "cache lines". Common cache line sizes are 32, 64 and 128 bytes.

You can think of the entire memory as being sub-divided into non-overlapping chunks.

Generally, when a certain memory address needs to be accessed, the CPU first checks if the specific chunk (cache line) that contains the desired memory address is already present in the cache. If so, then the address can be accessed from the cache right away – which is fast. Otherwise, the required chunk (cache line) needs to be transferred from the RAM into the CPU cache first – which is slow.

Some CPU architectures, especially so-called "RISC" architectures, only do computations with data in the CPU registers. There are special load and store instructions, which load data from the RAM/Cache into a register, or store data from a register back into the RAM/Cache. But, all other instructions only operate on the data that is already in the registers, and the result is then placed in a register again.

Conversely, other CPU architectures, especially so-called "CISC" architectures, provide instructions that support memory operands. These instructions can take their input directly from a memory address (i.e. RAM/Cache) and/or write their result directly to a memory address (i.e. RAM/Cache). Internally, CISC instructions are broken down into multiple "micro" instructions. So, even when the CISC instruction can work "directly" with memory addresses, what actually happens is a load, followed by the actual computation, followed by a store.

"Immediate" operands are fixed values (constants) that are hard-coded into the program code. Some CPU instructions can take such immediate operands (fixed values) as input – instead of taking their input values from registers, or from memory addresses.

Last edited on

mbozzi (3933)

As far as dynamically allocated memory is concerned, I think malloc() and friends do not give you any guarantees that the start address of the allocated memory block is aligned to anything bigger than 1 – even though, in reality, malloc() on most platforms has a certain minimum alignment. There are functions like _aligned_malloc() (MSVC) or posix_memalign() (Unix) to request a specific alignment.

malloc and friends are required to give you storage that is suitably aligned for "anything". Specifically they return pointers with the same alignment as std::max_align_t.

posix_memalign et al. can be useful to over-align objects, in case alignof(std::max_align_t) isn't strict enough. For example, a programmer might manually align objects to cache block boundaries to eliminate false sharing, or to meet the extended alignment requirements of vector instructions.

Additionally, since C++17, plain old new is capable of allocating over-aligned objects.

Last edited on

frek (576)

@kigar64551, thank you.
Does the CICS architecture use registers too or does it only use memory/cache for computations?

"Immediate" operands are fixed values (constants) that are hard-coded into the program code.

Could you write a simple example to figure out those fixed values. (Do you by any chance mean constant objects, e.g., const int i {2}; ?)

kigar64551 (815)

CISC architectures (e.g. "x86") have registers too, of course. Often you want to keep "intermediate" results in registers and continue your computations with these values right away without storing/loading those "intermediate" values to/from RAM. Also keep in mind: Instructions that work with memory operands effectively load the data from RAM into some sort of "temporary" (implicit, unnamed) register, then do the actual computation, then store the "temporary" value back to RAM. It's "syntactic sugar" for assembly programmers.

Could you write a simple example to figure out those fixed values. (Do you by any chance mean constant objects, e.g., const int i {2}; ?)

Whenever the compiler can figure out the value of one of the operands as "fixed" (constant), at compile time, it can (and probably will) translate that value into an "immediate" operand. A simple code example would be i++ 😏

(Since the value to be added to i is the constant value 1, it can be hard-coded as "immediate" operand; otherwise we would have to first load the value "1" into a register and then add it to i from there, which clearly is an unnecessary overhead)

Last edited on

frek (576)

A few more questions
1) Some examples of CISC processors include Intel x86 CPUs, System/360, VAX, PDP-11, Motorola 68000 family, and AMD. Examples of RISC processors include Alpha, ARC, ARM, AVR, MIPS, PA-RISC, PIC, Power Architecture, and SPARC. My CPU is Intel(R) Core(TM) i3-4160 CPU @ 3.60GHz, 3600 Mhz, 2 Core(s), 4 Logical Processor(s) with the System Type: x64-based PC
Since I'm not sure where the RICS and CISC categories locate in the wider category of CPU architectures I'm not sure which type my processor belongs to.

2) If I'm not mistaken generally it's said tat x86 and x64 CPUs have x86 (32-bit) and x64 (64-bit) registers, respectively. Is it correct?
3)

Whenever the compiler can figure out the value of one of the operands as "fixed" (constant), at compile time, it can (and probably will) translate that value into an "immediate" operand.

How about this compile time statement: constexpr int i {3};?

Last edited on

kigar64551 (815)

(1) I don't think there is a very strict definition of "RISC" and "CISC". They are more like design philosophies (or marketing tools) for processor instruction sets. Usually the x86 architecture (as well as "x64", aka "AMD64") is considered CISC, whereas ARM (including "ARM64") as well as PowerPC and MIPS are considered RISC. But, keep in mind, that even CISC processors internally translate each "complex" instruction into a sequence of RISC-like "micro" instructions. So, the differences between CISC and RISC are kind of blurry 😏

(2) The original x86 (32-Bit) architecture has 32-Bit "general purpose" registers. But it also has 80-Bit "floating point" (FPU) registers! Also, some of the newer x86 (32-Bit) processors support vector extensions like MMX or SSE, which use special "vector" registers with a size of 64-Bit (MMX) or 128-Bit (SSE). Now, with the x64 (64-Bit) architectures, all "general purpose" registers have been widened to 64 bits! But the "floating point" (FPU) registers still are 80-Bit in size. The "vector" registers are 128-Bit (SSE), 256-Bit (AVX), or 512-Bit (AVX-512) in size.

(MMX has the special weirdness that its 64-Bit "MM" registers are mapped into the 80-Bit "floating point" registers)

(3) I think this would be a good candidate. But it all depends on how the variable is used! Even a mutable (non-const) variable can be translated into an "immediate" operand – in theory – provided that the compiler can figure out (e.g. by static code analysis) that its value will always be the same at the relevant point of the program. Marking the variable as const helps the compiler in optimizing, though.

Last edited on

seeplus (6597)

RE 1). Have a look at:
https://en.wikipedia.org/wiki/Reduced_instruction_set_computer

frek (576)

Thank you.

Now lets see this and consider the way it uses the RAM:

1
2

short sh {1234};
int i {1234};

To me using short here is better at least in terms or RAM usage. Although registers might be 16, 32, 64 or more bytes in size, the memory used for short is 2 bytes while for int it's 4 bytes, usually. Am I correct?

seeplus (6597)

Have a look at this on Godbolt:
https://godbolt.org/z/PMq18G4MW

You can choose different compilers (including RISC Arm etc) and see what is the resulting compiled assembler code.

frek (576)

This is the code

int main() {
    short sh {1234};
    int i {1234};

    sh = 432;
    i = 432;
}

And this is the Assembly:

 mov    WORD  PTR [rbp-0x2],0x4d2
 mov    DWORD PTR [rbp-0x8],0x4d2
 mov    WORD  PTR [rbp-0x2],0x1b0
 mov    DWORD PTR [rbp-0x8],0x1b0

For the first and second statements, the values are put/moved into two registers (rbp-0x8, rbp-0x2) I assume, normally, although I'm not sure about their difference (one ended in 8 and the other 2)

But these are the execution, what I meant was the amount of bytes the objects take on the main memory.

Last edited on

kigar64551 (815)

rbp is the frame pointer. It contains the offset of the "current" stack frame, i.e. the stack frame of the current function.

Note that "local" variables, as sh and i in your code, are allocated on the stack. Hence, their address is relative to the stack pointer (rbp).
https://en.wikipedia.org/wiki/Stack_(abstract_data_type)#/media/File:ProgramCallStack2_en.svg

So, the instruction mov WORD PTR [rbp-0x2],0x4d2 stores the WORD (2 bytes) value 0x4D2 (decimal: 1234) at the memory location rbp-0x2, i.e. at the address of the local variable sh. Similarly, the instruction mov DWORD PTR [rbp-0x8],0x4d2 stores the DWORD (4 bytes) value 0x4D2 (decimal: 1234) at the memory location rbp-0x8, i.e. at the address of the local variable i.

How do we know that rbp-0x2 and rbp-0x8 are the addresses of sh and i? Well, the pure assembly code does not contain this information! Assembly code does not know about "variables". But it becomes clear when comparing the assembly code with the original C code 😏

Note that, in this example, [rbp-0x2] is a memory operand, whereas 0x4d2 is an immediate value. No registers involved.

Last edited on

jonnin (11441)

yes, shorts take up less memory than ints, which in turn means fewer page swaps in the cpu cache, meaning that a number of 'wait on memory stuff' delays are bypassed. The more you have of them in a large array/vector type construct, the more this will benefit you, while having just a few of them has no benefit at all (less than 1 page of ints won't benefit from a half page of shorts).

there are other concerns about performance, though. Some designs may be hiding a promotion (a risc machine for example may ONLY work with 64 bit integers in a cpu and could be quietly promoting char/short etc to 64 bit and back again) so it can support fewer instructions (I don't know if anything major does this kind of thing, but some tiny embedded cpus may). All floating point is prompted to the biggest type and back regardless, as I understand it, to help with loss of precision as well as to reduce supported operations / instruction sets. The biggest is a bit bigger than double, by the way, used to be 80 bits.

On top of that, other concerns come into play and can govern whether going to shorts vs ints is actually worth it. Your best bet is to either research the assembly created or how the cpu handles the difference, or maybe easier, just time it both ways in your compiler to see if you gain anything. Note that for floating point specifically, this violates the 'always use doubles' rule of thumb that is thrown about at beginners. Using floats can be faster due to memory profile/handling and there is a time and place to use the smaller size if you need the extra speed or space.

Pages: 12