Memory alignment

Pages: 12
So I know that memory alignment to a 1 byte boundary may be used in special circumstances lets say you need to read or write data to conform with network protocol standards in memory, or file formats in memory.

It is said that using padding although using more memory allows the CPU to execute fewer instructions, so why is this? Why would a padded structure in memory be quicker to access and write to then let's say a struct that has no padding or in other words is packed to a one byte boundary?

thanks
Why would a padded structure in memory be quicker to access and write to then let's say a struct that has no padding or in other words is packed to a one byte boundary?

This isn't a hard-and-fast rule. Some systems cannot access misaligned data, i.e., trying to do so will cause a hardware exception. Other systems do not incur any speed penalty for misaligned accesses. Others still perform the memory access, except it's a different access than the one you asked for.

Misaligned accesses are typically slow because a processor's memory subsystem is designed in such a way that memory is subdivided into small "chunks" of 2, 4, 8, etc. machine bytes.

If x represents a byte of data we're interested in (e.g., a 4-byte integer), when it is aligned, it doesn't cross any 4-byte boundaries:
[ x x x x ] [ - - - - ]
But when it is misaligned, it crosses one:
[ - x x x ] [ x - - - ]
In which case the processor must load both four-byte chunks and then perform the bitmath required to reconstruct the original data.

memory alignment to a 1 byte boundary may be used in special circumstances

Unless your solution is tied to a particular compiler & target, under-aligning data (i.e., declaring packed structs via compiler extension) is most likely not the right way to solve the problem.
A portable and strictly correct way is generally to std::memcpy to or from an array of char.
Last edited on
"Alignment" means that the memory address of an object is a multiple of some number, usually a power of 2. Therefore, alignment to 1 byte is having no alignment. The most common alignments are to 4, 8, 16, and 32 bytes.

It is said that using padding although using more memory allows the CPU to execute fewer instructions
This is not entirely correct.
* Some architectures (now this is rare) require aligned memory accesses, and generate an exception on unaligned accesses. Thus if the compiler can't tell if an address is aligned at compile time it needs to generate more more code to ensure the access will be aligned.
* Some architectures can access memory unaligned, but aligned memory accesses are faster. This is for the most part the case on x86.
* Some architectures with unaligned accesses contain special purpose instructions that require aligned accesses. This is the case with SIMD instruction sets on x86. SIMD instructions typically require fairly wide alignments; up to 256 bits, IIRC.

why is this?
They're consequences of the electronic circuit design. Requiring alignment means that the circuit can be simpler and (more importantly) smaller and more energy-efficient.
In cases where unaligned accesses are slower, that's a consequence of the memory architecture. It's simpler to design a memory device (be it RAM, a hard disk, or whatever) that can only work in large blocks. In the case of x86-64, RAM is accessed in blocks of 64 bits. You can read 64 aligned bits in a single operation, but to read 64 unaligned bits you need to do one read for the low bits and another for the high bits.

EDIT:
Unless your solution is tied to a particular compiler & target, under-aligning data (i.e., declaring packed structs via compiler extension) is most likely not the right way to solve the problem.
It's so convenient, though! The language really needs a way to a) specify non-alignment in the type (e.g. unaligned<IPPacket>), and b) request automatic conversions between aligned and unaligned versions of the same type.
Last edited on
First off, thanks guys great answers :)

I'm not sure if I'm interpreting this correctly but let's say we have a processor that executes more efficiently when memory is aligned, let's say we have the following an int, a char and a short

1
2
3
4
5
6
7
#pragma pack(1)
  
  struct{
      int a // 4 bytes
      char b // 1 byte
      short c // 2 bytes
   }


so the processor reads data from memory( 64 bit system ) 64 bits at a time, so here this struct shouldn't be a problem because it's < 64 bits, this brings me onto a side question, CPU registers are 64 bits on a 64 bit system so they can work with 64 bit ints more efficiently and can store 64 bit memory addresses in the registers, but most of the time especially in this case the largest primitive type we are dealing with is 4 bytes so we only need to store the address of the first 4 bytes in lets say general purpose register rax, are the rest of the bytes stored in cache memory in the CPU for faster access? such as reading char b after a.

Now let's say our CPU accesses 4 byte blocks

1
2
3
4
5
6
7
8
#pragma pack(1)
  
  struct{
      int a // 4 bytes
      char b // 1 byte
      short c // 2 bytes
      int d // 4 bytes
   }


our struct here is 11 bytes, so the CPU first does a read and takes a from memory, the next fetch is char b and c BUT it will also take one byte from d, next the rest of d is fetched.

so that means that it will take 2 fetches to get d , one for the first 3 bytes and one for the final 1 byte.

would that be the reason why it's less efficient to pack structures to one byte boundaries?



so the processor reads data from memory( 64 bit system ) 64 bits at a time, so here this struct shouldn't be a problem because it's < 64 bits
Writes may still perform worse, though, since writing to any member implies reading the 64-bit block. Also, if the struct is in an array or an object it may still lie across a block boundary.
Last edited on
so the processor reads data from memory( 64 bit system ) 64 bits at a time,

If we're talking about memory read/writes and processors, the cache lines on most computers these days are typically 32 or 64 bytes (e.g. Core i7, Ryzen Zen 2, etc.)

Meaning if the value isn't in your cache, you need to read 64 bytes out of main memory.
Last edited on
that makes sense,

I came across this code on memory alignment for comparing MACs, to be honest I can't even fathom what is happening here, a lot of bit manipulation ( but would like to know)

all I know is that it's a function to compare 2 MAC addresses

here is the snippet from the article including the code



Code that causes unaligned access

The following function taken from include/linux/etherdevice.h is an optimized routine to compare two ethernet MAC addresses( 48-bits, 6 bytes ) for equality::
1
2
3
4
5
6
7
8
9
10
11
12
13
  bool ether_addr_equal(const u8 *addr1, const u8 *addr2)
  {
  #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
     ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4)));

 return fold == 0;
  #else
 const u16 *a = (const u16 *)addr1;
 const u16 *b = (const u16 *)addr2;
 return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0;
  #endif
  }

When CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is not set, the hardware isn't able to access memory on arbitrary boundaries, the reference to a[0] causes 2 bytes (16 bits) to be read from memory starting at address addr1. This is understood to only work normally on 16-bit-aligned addresses.



what is the pre processor definition - CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, is this definition built into the compiler itself?

and what is this function doing more accurately what does CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS do?


1
2
3
4
u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
     ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4)));

 return fold == 0; 


I'm guessing in the first part of the assignment the addresses are converted to 32 bit int pointers and their values are xor'd ie being compared, why the need for the second check

(*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4))

we are obviously also comparing values but what are we comparing? so addr1 is being cast to a 16 bit integer but addr1 + 4, wouldn't this be 16 + 16 + 16 + 16 = 64?

why is one comparing a 32 bit value and one comparing a 64 bit value? MAC addresses are 48 bits?

but more importantly, what is #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

and if this is not true why would we move to

1
2
3
4
5

const u16 *a = (const u16 *)addr1;
 const u16 *b = (const u16 *)addr2;
 return ((a[0] ^ b[0]) | (a[1] ^ b[1]) | (a[2] ^ b[2])) == 0;



* I'm guessing the first condition is checking if padding is enabled in a certain way? and the else section is assuming the memory is packed? because the second section is indeed 48 bits the size of a MAC addr

* from the article I'm reading - https://www.kernel.org/doc/Documentation/unaligned-memory-access.txt

In the above function, when the hardware has efficient unaligned access
capability, there is no issue with this code. But when the hardware isn't
able to access memory on arbitrary boundaries,
the reference to a[0] causes
2 bytes (16 bits) to be read from memory starting at address addr1.


* I'm not sure what is meant by what's in bold.
Last edited on
why the need for the second check
Because MAC addresses are 6 bytes long, not 4.

so addr1 is being cast to a 16 bit integer but addr1 + 4
No, addr1 is not being cast to const u16 *. addr1 + 4 is being cast to const u16 *.

wouldn't this be 16 + 16 + 16 + 16 = 64?
I have no idea what this means.

but more importantly, what is #ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
The build system (e.g. GNU autoconf, or cmake) defines CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS if the platform doesn't incur a significant cost when performing unaligned accesses.

I'm guessing the first condition is checking if padding is enabled in a certain way? and the else section is assuming the memory is packed? because the second section is indeed 48 bits the size of a MAC addr[/quotes]Both versions assume the MAC address is stored as an array of 6 bytes. One version assumes the platform can perform unaligned access quickly and takes advantage of that fact, while the other version opts for a more roundabout check that (the developer reasoned) should be more efficient than the other if the platform can't perform those accesses efficiently.

[quote]I'm not sure what is meant by what's in bold.
The highlighted statement seems fairly straightforward to me. What don't you understand about it?
what does he mean by "efficient unaligned access" or what does it constitute to have efficient unaligned access

but "The build system (e.g. GNU autoconf, or cmake) defines CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS if the platform doesn't incur a significant cost when performing unaligned accesses."


that probably answers my question ^^
Last edited on
This function is part of the Linux kernel, so the macro is defined elsewhere in a kernel header - perhaps as part of a generated header file depending on the target architecture. It's name should be explanatory, though:
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is defined if the build is configured to assume that the target system supports efficient access to unaligned data.

One thing to understand is that the Linux kernel is written in C, compiled by GCC in particular, and without strict aliasing.

In general, for integral types, we can write ^ in place of !=, with the understanding that where a != b evaluates to exactly 1, a ^ b evaluates to nonzero. Apparently this is sometimes faster.

The function takes two byte pointers, which are required - by contract - to be aligned on a 2-byte boundary.

If the processor can (quickly) load misaligned data, then 6-byte MAC addresses can be compared by breaking each into one chunk of 4-bytes (which is potentially misaligned) and one chunk of 2 bytes (which is always aligned), and comparing the corresponding chunks.

If the processor can't quickly load misaligned data, then each MAC address must be broken into three (aligned) chunks of 2 bytes each. That requires an extra comparison.

Edit, whoops, should have refreshed before posting.
Last edited on
1
2
u32 fold = ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
     ((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4))); 



wouldn't this be 16 + 16 + 16 + 16 = 64?
I have no idea what this means.



I think it's the pointer math I'm having trouble with, addr+4 address + 16 bit increments? so 16*4 = 64?

*(const u32 *)addr1

this is clearly the first 32 bits, but when we convert it to 16 bits wouldn't we be incrementing the addresses by 16 bits 4 times?
this is clearly the first 32 bits, but when we convert it to 16 bits wouldn't we be incrementing the addresses by 16 bits 4 times?


All the pointer math (in this statement) is done on byte pointers.
addr1 + 4 is offset 4 bytes from addr1. One byte increments.
All the pointer math (in this statement) is done on byte pointers.
addr1 + 4 is offset 4 bytes from addr1. One byte increments
.

that makes sense, I thought we were incrementing by 2 bytes

If the processor can't quickly load misaligned data, then each MAC address must be broken into three (aligned) chunks of 2 bytes each. That requires an extra comparison.


what would the extra comparison be? I can only see 3 from the code, maybe I'm confusing || for |?
Last edited on
1
2
3
4
//1:
((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) |
//2:
((*(const u16 *)(addr1 + 4)) ^ (*(const u16 *)(addr2 + 4)))

1
2
3
4
5
6
//1:
(a[0] ^ b[0]) |
//2:
(a[1] ^ b[1]) |
//3:
(a[2] ^ b[2])
true there is an extra comparison compared to the first, mbozzi already explained it but I still don't fully understand why the need for 3 comparisons in the second test as opposed to 2 in the first test?

why are we breaking them down into 16 bit chunks in the second test and in the first just 32 bits and 16 bits?

The function takes two byte pointers, which are required - by contract - to be aligned on a 2-byte boundary.


and one chunk of 2 bytes (which is always aligned)


why would they(byte pointers) be guaranteed to be aligned to a 2 byte boundary?

Last edited on
why are we breaking them down into 16 bit chunks in the second test and in the first just 32 bits and 16 bits?
Because the writer reasoned that (a[0] ^ b[0]) | (a[1] ^ b[1]) was faster than ((*(const u32 *)addr1) ^ (*(const u32 *)addr2)) on some architectures.

why would they(byte pointers) be guaranteed to be aligned to a 2 byte boundary?
The quote you pasted explains exactly why. It's by contract. It's the responsibility of the caller to ensure that they're aligned. If they're not, the function is free to behave incorrectly and/or suboptimally.
Last edited on
Why would [the arguments] be guaranteed to be aligned to a 2 byte boundary?

It's an precondition for calling the function. Here's the manual page:
https://www.kernel.org/doc/htmldocs/networking/API-ether-addr-equal.html

This precondition implies that any of the three 2-byte chunk of MAC address is aligned (on a 2-byte boundary). However, it does not imply that any 4-byte chunk of MAC address is aligned (on a 4-byte boundary).
[[ - - ] [ x x ]] [[ x x ] [ x x ]]
           ^

e.g., in the diagram, the MAC address doesn't begin on a 4-byte boundary.

Why are we breaking them down into 16 bit chunks in the second test and in the first just 32 bits and 16 bits?
Because presumably it's faster to do the comparison in 2 parts instead of 3. The routine is optimized for speed.
Last edited on
This precondition implies that any of the three 2-byte chunk of MAC address is aligned (on a 2-byte boundary). However, it does not imply that any 4-byte chunk of MAC address is aligned (on a 4-byte boundary).
[[ - - ] [ x x ]] [[ x x ] [ x x ]]
^
e.g., in the diagram, the MAC address doesn't begin on a 4-byte boundary.


[[ - - ] [ x x ]] [[ x x ] [ x x ]]


so I'm guessing in the diagram (first 4 bytes), are padded on a 2 byte boundary?

I may need to backtrack a little on what it means to be aligned on a 2 byte boundary,

so here is my understanding of memory alignment so far, let's say our processor has a 32 bit word size, so it will load 32 bits at a time into cache lines on the chip from memory,

1
2
3
4
5
6
7
8

struct{

  char A; // 8 bits
  int B; // 32 bits
  short C; // 16 bits 
}


let's say we have normal memory alignment of what the processor or compiler finds the most efficient to work with, so we will pad A with 3 more bytes, B is fine the way it is, and C is fine the way it is,

so each member will be read from memory in one access.

but if we were to let's say have a 2 byte boundary,

A would be 2 bytes right? B would still be obviously 4 bytes and lastly, C would still be 2 bytes

so we would access the first 4 bytes, which would take 2 bytes from char A( 1 byte padding ) and take the first 2 bytes from B. The next access would grab the final 2 bytes from B and 2 bytes from short C.

we have less accesses but more overhead I'm guessing because the processor or OS would have to reassemble the the integer b which would cause more machine instructions hence slower. ( although we have saved space in memory)

would my understanding above be correct?

so if it is going back to the diagram

[[ - - ] [ x x ]] [[ x x ] [ x x ]]

wouldn't this be a 4 byte boundary? as each chunk contains 4 bytes?

and also in the link (https://www.kernel.org/doc/htmldocs/networking/API-ether-addr-equal.html) it says "Please note: addr1 & addr2 must both be aligned to u16."

If the processor can (quickly) load misaligned data, then 6-byte MAC addresses can be compared by breaking each into one chunk of 4-bytes (which is potentially misaligned) and one chunk of 2 bytes (which is always aligned), and comparing the corresponding chunks.


how could the first chunk of 4 bytes be potentially misaligned and why would the final 2 bytes be aligned?


thanks for sticking with me, I think I'm on the cusp of understanding this.
Last edited on
Look, you're overthinking it. I'll reiterate: "alignment" means that the address of an object is a multiple of an integer, usually a power of 2. In the case of the above function, "aligned to 16 bits" means that this is true: (uintptr_t)addr1 % 2 == 0. Nothing more.

how could the first chunk of 4 bytes be potentially misaligned and why would the final 2 bytes be aligned?
Because it's the function that requires alignment to 16 bits. The CPU may require or prefer more strict alignments, such as to 32 or 64 bits.

Aligned to 32 bits:
DEADBEE0 80
DEADBEE1 00
DEADBEE2 10
DEADBEE3 57

DEADBEE4 F5
DEADBEE5 A0
DEADBEE6 ??
DEADBEE7 ??

Aligned to 16 bits:
DEADBEE0 ??
DEADBEE1 ??
DEADBEE2 80
DEADBEE3 00

DEADBEE4 10
DEADBEE5 57
DEADBEE6 F5
DEADBEE7 A0

In the first case, the CPU can read the first 32 bits of the MAC address in one operation, but in the second case the first 32 bits cross a 32-bit boundary, so the CPU needs to perform two reads, and then it make need yet another read to get the last 16 bits. In such an architecture, it may be quicker to compare the addresses in 16-bit chunks at a time, even it takes more bitwise operations.
Pages: 12