I wrote my own network TCP/IP send/receive benchmark utility, and I have results similar/consistent with iperf. One reason I wrote my own was to try to better understand why localhost performance is so much slower than RAM (especially given that localhost isn't even suppose to touch the NIC -- you can disable ALL your network adapters, and still ping and send TCP over 127.0.0.1). I've assume localhost is implemented by the host OS (Windows 10 Pro 64-bit in this case) and NOT the NIC drivers or motherboard drivers (where I noticed localhost has a MTU of 4GB, but otherwise I'm not sure what size send/receive buffers are allocated to localhost).
I am seeing localhost performance of about 4000-8000 Mbps (or about 600-900 MBps) on i7-DDR3 based hardware (stuff under 2 years old), and memtest86+ is showing RAM speeds much faster than this (say 20k-40k Mbps). I assume the difference is due to TCP/IP packaging overhead (though localhost MTU is 4GB, I assume the host OS still has to partition data in packets). Note, I am using boost 1.60 and VS2015 Community (boost was also compiled with my VS2015) for the io service library.
My main real question is this: I have 12 different systems that I've tested this benchmark on (various laptops and desktops). I have a particular system that is consistently getting HALF the network I/O performance across localhost than my other systems. I HAVE tested with some older/slower 2007-2009 era DDR2-based systems, and indeed CPU/memory performance does effect the benchmark throughput (as expected!). But my original question stands because the unexpected performance is on a fairly modern system. So I suspect my question is more of a motherboard/chipset architecture question. Here are the details:
SYSTEM A: (named OANH, average result is 880/940 MBps send/recv performance)
OS Win10 64-bit, 16GB DDR3, i7-4770/3.4GHz CPU, ASUS B85M-E/CSM mainboard
mainboard link: http://www.asus.com/Motherboards/B85ME/
These results are with both 125MB and 1024MB/1GB payloads. And again, I got similar results with iperf3 64-bit. I have several other i5/i7 DDR3-based systems that get similar performance as SYSTEM A/OANH. So that's my question -- memtest shows SYSTEM B/BLACKJACK has similar DDR3 main memory performance as all the other DDR3 machines. And they are using the same OS (Win10 64-bit). Architecturally, what could be causing localhost/127.0.0.1 traffic to be so much slower (almost half) on SYSTEM B? Does sending across localhost (under win10) involve the north or south bridge, or is any part of the bus involved?
While this is not a specific C++ question, eventually I would like to share my C++/boost implementation of a network benchmark. But I'm hoping to come across someone with insight on localhost implementation under Windows. I suppose in addition I should try to run this under Linux. But again, all my other Win10 64-bit systems are getting over 600MBps send/receive performance -- it's just this one machine that is getting the 300MBps half-performance, and I haven't yet really come up with a rationale on why.
I use 131072000 since that is 125MB, which across an actual LAN Ethernet connection should take only 1 second to transfer (with gigabit components, as 1000Mbps = 125MBps). Across localhost, you'd expect the 125MB to transfer much faster than 1 second (so at 300MBps, that's pretty fast -- but with same OS and similar h/w, I'd expect the 600-900MBps performance that the other configurations are getting). I can change -n to 1024*1024*1024 (1GB) and the MBps throughput should be the same (which it is).
edit: also to clarify the averages reflect over 1000 iterations/runs, and with Windows configured the same across the machines (e.g. Windows Search disabled, all applications closed, etc) -- too the extent possible, e.g. there are variations of video drivers and motherboard drivers.
edit: here are iperf3 results and my ANT benchmark results for comparison (on SYSTEM B). Note I match the ~320MBps result, though my benchmark also shows main memory performance of ~3200MBps (which is one of my main questions: why is localhost TCP/IP performance so much slower than main memory performance?)