In order to get a fast piece of code, I'd like to give a try to software prefetching (I don't want to use OpenMP at the moment). I'm running MSVS 2010 C++ compiler. However the code snippet below is slower than its non-prefetched version.
There are better forums to address this question. _mm_prefetch() isn't a standard C++ function. It's x86 specific. Intel has forums dedicated to this sort of question.
With that said, there is little guarantee that prefetching will always help. LWN had a great series of articles on optimizing memory access: http://lwn.net/Articles/255364/