I am currently taking a university course that asks us to improve on a non-multithreaded solution. The program, when presented with a list of numbers, counts those that are prime. Here is the original code:
/// counts number of primes from standard input
///
/// compile with:
/// $ $ gcc findPrimes.c -O3 -o count -lm
/// ./count < <textfile with different integers>
///
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>
/// primality test, if n is prime, return 1, else return 0
int isPrime(int64_t n)
{
if( n <= 1) return 0; // small numbers are not primes
if( n <= 3) return 1; // 2 and 3 are prime
if( n % 2 == 0 || n % 3 == 0) return 0; // multiples of 2 and 3 are not primes
int64_t i = 5;
int64_t max = sqrt(n);
while( i <= max) {
if (n % i == 0 || n % (i+2) == 0) return 0;
i += 6;
}
return 1;
}
int main( int argc, char ** argv)
{
/// parse command line arguments
int nThreads = 1;
if( argc != 1 && argc != 2) {
printf("Uasge: countPrimes [nThreads]\n");
exit(-1);
}
if( argc == 2) nThreads = atoi( argv[1]);
/// handle invalid arguments
if( nThreads < 1 || nThreads > 256) {
printf("Bad arguments. 1 <= nThreads <= 256!\n");
}
if( nThreads != 1) {
printf("I am not multithreaded yet :-(\n");
exit(-1);
}
/// count the primes
printf("Counting primes using %d thread%s.\n",
nThreads, nThreads == 1 ? "s" : "");
int64_t count = 0;
while( 1) {
int64_t num;
if( 1 != scanf("%ld", & num)) break;
if( isPrime(num)) count ++;
}
/// report results
printf("Found %ld primes.\n", count);
return 0;
}
I am tasked with changing this so that it takes a command line argument that dictates the maximum amount of threads that can be used at any given point. My thoughts were to create the maximum possible number of threads for each number in the list, then as each thread finishes, let it process the next number. I am unsure why I am sometimes getting different answers. I believe I have properly locked the critical section in the code. Unfortunately, I am also not seeing any speedup. Some pointers in the right direction would really help.
generate 50 million random numbers b/w 1 and 10,000 and count how many are prime:
doing it in one thread uses at least 2.5X longer CPU time compared to doing the same task split across 50 threads, and that is AFTER taking into a/c the time spent constructing the threads
its "usually" faster if you can make totally independent threads so you don't need any mutex at all (each thread works a piece of the problem ignorant of the other threads). Its "usually" faster to have only a few threads per core -- the more threads you pile onto each cpu core, the more time you squander doing context switching instead of computation. Esp for windows but for busy unix machines also its "usually" better to leave one core free(or doing lighter duty) for the OS to use.
There are some problems where violating one or more of the above works better, so they are only rules of thumb. Some threaded problems simply require the mutex and working together approach due to shared memory and using results from each other. But all a mutex does is block a thread from doing work, which is defeating the purpose of the thread for some portion of your run-time and worse, mutex are objects that have to be managed. Threads cost memory and more also, so more of them is overhead.
Somewhere there is a sweet spot for the given hardware. It might be 50, it might be 10.
If this is just a thread-study, you can stop there. There are a couple dozen tricks you can play to make the code better in practice for the specific problem that have nothing to do with threads.