Hi. I have a C++ code reading large data from an input txt file, doing some calculation on the data, and writing the result of calculation in another txt file.
I have about 300 input files, and the calculation time for each input file is pretty long (~4 days on a single CPU), so I would like to run the same code on multiple CPUs for different inputs.
Can you please tell which is the most appropriate strategy in this case, multithreading, mpi or something else?
Thank you.
Just so that we are clear, by CPU you mean just an actual processor core right? I've noticed some people applying it as a misnomer for full PC systems.
Usually it's best to let the system scheduler decide how to use it's resources, if you let us know what OS you are running then I'm sure someone here can tell you how to optimize it better. If you have your mind set on micromanaging this then we would need to know the platform you are working with since this is (potentially) an OS specific operation.
300 independent inputs, roughly same size?
CPU-bound, i.e. running N copies of the program simulataneously in a machine that has N cores does not exhaust the RAM?
Files can be copied to fast volume, i.e. simultaneous IO is not a bottleneck?
Trivial parallelism. Start one instance for each available core, each with different input. Start more jobs when previous complete until all 300 have started. A job scheduling/queuing system can handle the latter part automatically even on "clusters" that have thousands of computers.
Vectorization makes a process to use one core more effectively (if the core supports it, i.e. SSE/AVX instructions). Compiler can do autovectorization for some code. This you have studied?
Threading, MPI, GPGPU. Well, if you have hardware and your algorithm can clearly benefit from these techniques, and you modify the code, then maybe.
I would start with the trivial approach though. You could rent enough cores from some cloud for a short period, if you lack local hardware and cannot afford to wait.
Most operating systems won't know to dedicate entire cores to specific processes and unless you fuss around with the priority then it's going to build up some overhead expense over time switching between the individual tasks. In Windows the second factor in determining which process gets the most attention is the number of threads that process has loaded into it's image (the first is of course the priority). So on that platform starting separate instances of the program would actually be counter productive to your goal. The trick here is going to be managing the individual threads stack size since with this much data I'd imagine that balancing page faulting issues with disk caching is going to become a big issue.
@ OP: Is this maybe a job for some kind of ASIC type device?