Performance of memcpy() vs. "direct" memory access

Hello,

I am working on a project where we need to compute the MD5 hash of a large block of memory (e.g. several GB of data in the RAM) on a regular basis to make sure that the data hasn't changed. This is to protect against "random" memory errors, so MD5 should be fine as "checksum" here; probably a little better than CRC64.

Anyway, the "naive" approach would be to just call md5_update() once, with the base address and the size of the whole memory block. However, I found this to be rather slow. After a lot of testing, I figured out that "loading" small chunks (e.g. 8 byte) of the big memory block into a local variable, via memcpy() function, and then passing them to the md5_update() function chunk-by-chunk is about 2.5 times faster!

Of course I could just be happy with the speed-up. But I would like to understand why memcpy() is faster here, even though it has the "overhead" of copying all the data as well as the "overhead" of many more small-size md5_update() invocations. I thought this may be related to memcpy() doing some "smart" prefetching. But manually passing small chunks from the memory and calling _mm_prefetch() on each chunk before passing it to md5_update() did not make things faster at all. It appears explicit memcpy() is needed to get the speed-up.

"Naive" (slow) version:
1
2
3
4
5
6
7
static void test(const uint8_t *const data, const size_t size, uint8_t *const digest)
{
	md5_ctx ctx;
	md5_init(&ctx);
	md5_update(&ctx, data, size);
	md5_final(&ctx, digest);
}


"Fast" version with memcpy():
1
2
3
4
5
6
7
8
9
10
11
12
13
14
static void test2(const uint8_t *const data, const size_t size, uint8_t* const digest)
{
	const uint8_t* p;
	uint64_t temp;
	md5_ctx ctx;
	const uint8_t* const end = data + size;
	md5_init(&ctx);
	for (p = data; p < end; p += sizeof(uint64_t))
	{
		memcpy(&temp, p, sizeof(uint64_t));
		md5_update(&ctx, (const uint8_t*)&temp, sizeof(uint64_t));
	}
	md5_final(&ctx, digest);
}


Explicit pre-fetching, not faster at all:
1
2
3
4
5
6
7
8
9
10
11
12
13
static void test3(const uint8_t *const data, const size_t size, uint8_t* const digest)
{
	const uint8_t* p;
	md5_ctx ctx;
	const uint8_t* const end = data + size;
	md5_init(&ctx);
	for (p = data; p < end; p += sizeof(uint64_t))
	{
		_mm_prefetch(p, _MM_HINT_T0);
		md5_update(&ctx, p, sizeof(uint64_t));
	}
	md5_final(&ctx, digest);
}


Full example code see here:
https://www.mediafire.com/file/4kc5oydui3kygtd/memcpy-test-v2.zip/file

I have posted this under "Windows Programming" because this is a Visual Studio 2019 project, currently running on Windows 11. But I have also tested on Windows 8.1 with similar result.
Last edited on
You'd have to dig into the assembler of memcpy() to find out if there were any pre-fetch or caching hints.

Then put those same hints into md5_update().
Or at least into your for loop calling md5_update() on hint-able sized blocks of memory.
You'd have to dig into the assembler of memcpy() to find out if there were any pre-fetch or caching hints.

How do I do this? Program is written in C, not assembler.

I know there are disassemblers, like DA Pro, to convert the EXE to assembler code. But I don't have a license for that. And I wouldn't really know what to look for, because I am not an expert in assembler code...
Last edited on
OP's program contains this function to allocate memory:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
uint8_t* allocate_memory(const size_t size)
{
  if (SetProcessWorkingSetSize(GetCurrentProcess(), allocated + size, allocated + size))
  {
    const PVOID memory = VirtualAllocEx(GetCurrentProcess(), NULL, size, 0x3000 0x204);
    if (memory)
    {
      if (VirtualLock(memory, size))
      {
        allocated += size;
        return (uint8_t*) memory;
      }
      else
      {
        VirtualFree(memory, 0U, MEM_RELEASE);
      }

    }
  }
  return NULL;
}


After decomposing the magic numbers in the call to VirtualAllocEx we get
VirtualAllocEx(GetCurrentProcess(), NULL, size, MEM_COMMIT | MEM_RESERVE, PAGE_NOCACHE | PAGE_READWRITE);
The key here is PAGE_NOCACHE:
https://docs.microsoft.com/en-us/windows/win32/memory/memory-protection-constants

Within the "slow" function, md5_update accesses a large buffer (obtained via the function above) many times, reading four bytes each time.
Within the "fast" function, md5_update accesses a eight-byte buffer that has automatic storage duration and is certainly in a speedy cache. Therefore relatively little time is spent inside md5_update waiting for the memory subsystem. Instead, it is memcpy which accesses the large buffer.

memcpy certainly copies eight bytes at once as opposed to md5_update; it makes fewer, wider memory accesses than would be required by the "slow" function. This explains the factor of 2.

Note that memcpy is usually a fairly adaptive function whose white-box operation and performance characteristics change with the size of the memory block being copied. Therefore, try slightly larger (16- or 32-byte blocks) and bigger (2048- or 4096-byte blocks). Make sure not to overrun the source buffer during testing.

Last edited on
memcpy certainly copies eight bytes at once as opposed to md5_update; it makes fewer, wider memory accesses than would be required by the "slow" function. This explains the factor of 2.


Thanks for the explanation. That makes a lot of sense.

Therefore, try slightly larger (16- or 32-byte blocks) and bigger (2048- or 4096-byte blocks). Make sure not to overrun the source buffer during testing.


Indeed, using 16 byte "local" buffer gives 4.7x speed-up, compared to x2.6 speed-up with the 8 byte buffer.
Last edited on
I don't know if this can be used to copy 16/cycle or not, as I have only scratched at the surface of 64 bit asm (unfortunately, the compiler I use won't allow inline asm in 64 bit programs, one of the biggest drawbacks to visual studio):

the SIMD instruction set with Streaming SIMD Extensions (SSE) Intel decided to give them a completely new set of registers named XMM which is twice longer (128 bits), so now we can operate on 16 bytes at once. Presumably these have a load and save command, so you should n theory be able to do that.
Last edited on
Okay, I now found how to enable assembler output in MSVC.

Again, I'm no expert in assembler, but it looks to me that memcpy() is translated into a single movups instruction:
1
2
3
4
; 325  : 		{
; 326  : 			memcpy(&temp, addr, 16U);

	movups	xmm0, XMMWORD PTR [r14]


https://www.felixcloutier.com/x86/movups

That should be SSE-optimized already, by the C compiler, to load the 16 bytes instantly, right? At least this would explain why 16 bytes "local" buffer is even faster than 8 bytes!
Last edited on
That sounds right. These 'new' instructions are... really nice.
Registered users can post here. Sign in or register to post.