OpenMP to compute PI

Jan 19, 2024 at 9:23pm

Greetings,

I am working my way into OpenMP.

The following code is supposed to compute pi.
It works fine for one thread, but in case of multiple threads the relative error gets larger. Can anyone spot the mistake? It should be trivial...

#include <iostream>
#include <omp.h>
#include <chrono>


int main(){


    long int N = 1000000000;
    long double dx = 1./N;

    long double result = 0;

    
    auto t1 = std:: chrono::high_resolution_clock::now();
    
    omp_set_num_threads(2);  // works only for 1
    #pragma omp parallel
    {
        long int rank = omp_get_thread_num();
        long int nr_ranks = omp_get_num_threads();
        
        long int n_per_rank = N/nr_ranks;
        long int N_start = rank*n_per_rank;
        long int N_end = N_start + n_per_rank;
        
        // In case N not evenly divisible by ranks, last rank does remaining iterations.
        if (N % nr_ranks != 0  &&  rank == nr_ranks-1){
            N_end += N % nr_ranks;
        } 

        double x_start = N_start*dx;

        // Main loop.
        for(long int i=N_start; i<N_end; ++i){
            result += 4/(1+x_start*x_start)*dx;
            x_start += dx;
        }

    }

    auto t2 = std:: chrono:: high_resolution_clock:: now();
    auto ms_int = std:: chrono:: duration_cast < std:: chrono:: seconds > (t2 - t1);
    std:: cout << "Execution took " << ms_int.count() << " seconds!\n";
    
    std::cout<<"error is "<<(result-3.14159265358979323846)/3.14159265358979323846<<"\n";


return 0;
}

Edit & run on cpp.sh

Last edited on Jan 19, 2024 at 9:27pm

Jan 20, 2024 at 12:02am

mbozzi (3943)

~~When OpenMP gets a for loop like you've written, it divides n_end by n_tasks and makes each task run n_end/n_tasks iterations in parallel.~~ For this to produce correct results, the loop iterations must be completely independent of each other.

Edit: I was describing
#pragma omp parallel for.
But OP has
#pragma omp parallel

In your program, every loop iteration depends on the result of the previous, ~~and that's the problem.~~ In other words, there is a loop-carried data dependency:

From Wikipedia:

A data dependency in computer science is a situation in which a program statement (instruction) refers to the data of a preceding statement.

https://en.wikipedia.org/wiki/Data_dependency

For example there is a data dependency in this code involving x:
x++; print(x);
There is a constraint on the order in which those statements can be executed. x++ has to be finished before print(x) starts else the program's meaning is changed.

A data dependency is called loop-carried when it spans iterations of a loop. For example:
while (true) { x++; print(x); }
The value of x++ depends on what x was the last time around the loop. The presence of one of these means the loop iterations can't be executed in parallel.

~~In your program, OpenMP doesn't know that the loop iterations are dependent and therefore it's reading and writing result and x_start in an arbitrary order from multiple threads at the same time.~~

I don't know enough about OpenMP to fix the problem though.

Last edited on Jan 22, 2024 at 12:59am

Jan 20, 2024 at 4:19am

Duthomhas (13276)

You can’t parallelize it. Computing π requires results from all prior loops.

You can google how people do it, and IIRC there’s a pretty interesting little article (plus source code) that the guy who computed the most digits of π wrote some years ago floating around out there.

Jan 20, 2024 at 10:21am

seeplus (6623)

See:
https://gist.github.com/bbengfort/bf62e3487b9732daebd5
https://github.com/yuxincs/parallel-pi

https://www.particleincell.com/2020/computing-pi/
This just uses standard C++ threads etc.

Jan 20, 2024 at 12:12pm

seeplus (6623)

Is this just a parallel excercise or do you really want to compute pi?

Since C++20 pi is defined as part of <numbers>
https://en.cppreference.com/w/cpp/numeric/constants

Consider (as C++20):

#include <iostream>
#include <iomanip>
#include <numbers>
#include <limits>

int main() {
	std::cout << std::setprecision(std::numeric_limits<long double>::digits10 + 1) << std::numbers::pi_v<long double> << '\n';
}

Edit & run on cpp.sh

Try this and see if it gives your required accuracy.

If you just need pi as a double then use std::numbers::pi

Last edited on Jan 20, 2024 at 12:13pm

Jan 20, 2024 at 1:41pm

PhysicsIsFun (297)

I am following this tutorial:

https://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf

Slides 18-20.

Jan 20, 2024 at 6:20pm

mbozzi (3943)

So I was wrong about how the OpenMP stuff works. I described to you #pragma omp parallel for, but from the tutorial you posted a plain #pragma omp parallel runs the following set of braces in multiple threads.

You've almost got it anyway - you just need some storage to hold the intermediate results from each thread.

(I thought OpenMP could do this for you.)

#include <iostream>
#include <omp.h>
#include <chrono>
#include <numeric>
#include <vector>


int main() {


  long int N = 1000000000;
  
  // use long double 1.l here or the division will be done in `double'
  long double dx = 1.l / N; 
  long double result = 0.0l;


  auto t1 = std::chrono::high_resolution_clock::now();

  int const threads_requested = 2;
  omp_set_num_threads(threads_requested);  // works only for 1

  // Notes:
  // - Each thread writes to exactly one array element in partial_sums.
  // - This array should be shared between threads.
  // - Outside of a parallel region, omp_get_num_threads() is apparently 1.
  std::vector<long double> partial_sums(threads_requested);
#pragma omp parallel
  {
    long int rank = omp_get_thread_num();
    long int nr_ranks = omp_get_num_threads();

    long int n_per_rank = N / nr_ranks;
    long int N_start = rank * n_per_rank;
    long int N_end = N_start + n_per_rank;

    // In case N not evenly divisible by ranks, last rank does remaining iterations.
    if (N % nr_ranks != 0 && rank == nr_ranks - 1) {
      N_end += N % nr_ranks;
    }

    long double x_start = N_start * dx; // double -> long double

    // Main loop.
    for (long int i = N_start; i < N_end; ++i) {
      partial_sums[rank] += 4 / (1 + x_start * x_start) * dx; 
      x_start += dx;
    }
  }

  // Make sure to sum up the result outside of the parallel region:
  result = std::accumulate(std::begin(partial_sums), std::end(partial_sums), 0.0l);

  auto t2 = std::chrono::high_resolution_clock::now();
  auto ms_int = std::chrono::duration_cast <std::chrono::seconds> (t2 - t1);
  std::cout << "Execution took " << ms_int.count() << " seconds!\n";

  std::cout << "error is " << (result - 3.14159265358979323846) / 3.14159265358979323846 << "\n";


  return 0;
}

Edit & run on cpp.sh

Last edited on Jan 20, 2024 at 6:21pm

Jan 21, 2024 at 4:28pm

PhysicsIsFun (297)

Thanks so much mbozzi!

I don't quite understand why I can't use a single shared `result` variable in the loop rather than the array.
If I have

    // Main loop.
    for (long int i = N_start; i < N_end; ++i) {
      results += 4 / (1 + x_start * x_start) * dx; 
      x_start += dx;
    }

where N_start, N_end, and x_start are all rank dependent, then I should still get the correct result as the order of the summation does not matter.

edit:
Ah you mentioned it:

In your program, OpenMP doesn't know that the loop iterations are dependent and therefore it's reading and writing result and x_start in an arbitrary order from multiple threads at the same time.

I don't quite get it. x_start should only ever be modified by its corresponding rank, no?

What's the difference between "normal" C++ threading and OpenMP? I thought one needed OpenMP for multithreading just like one would use MPI for multiprocessing.

Last edited on Jan 21, 2024 at 4:34pm

Jan 22, 2024 at 12:41am

mbozzi (3943)

Best disregard what I was saying in my first post. I didn't understand
#pragma omp parallel
Sorry.

I don't quite understand why I can't use a single shared `result` variable in the loop rather than the array.

Now that I know a little better, it's because this line of code is non-atomic: it can be interrupted in the middle of its execution.
results += 4 / (1 + x_start * x_start) * dx;
To illustrate why this is a problem, imagine running this snippet of code:

1
2
3

int x = 0; 
#pragma omp parallel num_threads(2)
{ x = x + 1; }

A possible flow of execution looks like this:

Thread A reads x, gets 0 (but does not write back to x yet)
Thread B preempts thread A.  
Thread B reads x, gets 0. 
Thread B writes 0 + 1 into x.
Thread A preempts thread B.
Thread A has already read x and found it to be 0.  
Thread A writes 0 + 1 into x.

At the end of the process x still contains 1. This is a symptom of a data race.

x_start should only ever be modified by its corresponding rank, no?

Yes. The race condition only affects result, not x_start.

What's the difference between "normal" C++ threading and OpenMP? I thought one needed OpenMP for multithreading just like one would use MPI for multiprocessing.

Threads are an operating system feature, so the OS lets you control them. Both the C++ library and OpenMP wrap up these system-specific features for use in portable code.

Relative to the C++ library, OpenMP is a "high-level" API. It hides some details, does lots in the background. It may also be that OpenMP offers less control over the system's resources in exchange for convenience, but I don't have enough info to say for sure.

Last edited on Jan 22, 2024 at 12:58am

Jan 22, 2024 at 12:59am

PhysicsIsFun (297)

Thanks mbozzi, I understand!

Topic archived. No new replies allowed.

C++

Forum

OpenMP to compute PI