scatterv in dot product

Forum

Forum
General C++ Programming
scatterv in dot product

scatterv in dot product

I am trying to do a dot product of two vector using mpi. the size of vector is not dividable by the number of processors. let's say the number of element in the vector is 14 and the number of processor is 3. I wrote the following code and I get a weird number. any help would be appreciated.

34269137790797595182651645633342565688518364029837794101293005152916938142904995370934546666244009453729783749949397852487680.000000

#include <math.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <cstdlib> 
#include <stdio.h>
#include <mpi.h> 
#include<vector>

double dotproduct(std::vector<double> &x, std::vector<double> &y);
double MPI_dotproduct(std::vector<double> *x, std::vector<double> *y);

int main(int argc, char *argv[])
{
	int num_procs, myrank;
	MPI_Init(&argc, &argv);
	MPI_Status status;
	MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
	MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
	std::vector<double>a(14);
	std::vector<double>b(14);

	for (int i = 0; i < 14; i++) {
		a[i] = i;
		b[i] = 1;
	}

	//dotproduct(a, b);
	MPI_dotproduct(&a, &b);

	MPI_Finalize();
	return 0;
}

double MPI_dotproduct(std::vector<double> *x, std::vector<double> *y) {
	
	int n=x->size(); 
	int local_n;
	double local_sum = 0;
	double result=0;
	int myrank, num_procs;
	int i;
	MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
	MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
	MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);

	if (myrank < (n%num_procs)) {

		local_n = n / num_procs + 1;
	}
	else {

		local_n = n / num_procs;
	}

	int *scounts = new int[num_procs];
	int *displs = new int[num_procs];

	for (i = 0; i < n % num_procs; i++) {
		scounts[i] = n / num_procs + 1;
		displs[i] = i * scounts[i];
	}
	for (i = n % num_procs; i < num_procs; i++) {
		scounts[i] = n / num_procs;
		displs[i] = i * scounts[i] + n % num_procs;
	}

	double *local_x = new double(local_n);
	double *local_y = new double(local_n);

	MPI_Scatterv(x, scounts, displs, MPI_DOUBLE, local_x, local_n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
	MPI_Scatterv(y, scounts, displs, MPI_DOUBLE, local_y, local_n, MPI_DOUBLE, 0, MPI_COMM_WORLD);

	for (i = 0; i < local_n; i++) {

		local_sum += local_x[i]* local_y[i];
	}

	MPI_Reduce(&local_sum, &result, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
	if (myrank == 0) {
		printf("The final result:%lf\n", result);
	}
}

Last edited on

lastchance (6980)

@resabzr, you don't pass vectors like that. int main() should call it with just the name of the vector, whilst your dot product routine can receive it by constant reference.

In the subsequent call to MPI_Scatterv, x does NOT point to the start of the data buffer. It, rather unnecessarily, points to the vector object. Use x.data() to get a pointer to the data buffer (as you have been repeatedly told in other threads).

You code leaks memory like a sieve. Every new requires a delete.

You need to refactor your program. Every processor currently declares and initialises both a and b; there's little point in scattering data if all processors already know it.

If your dot product routine is to return a double to all processors then you need MPI_Allreduce, not MPI_Reduce.

lastchance (6980)

For a one-off scalar product, where you first have to calculate how the data is to be distributed and then scatter it in an unbalanced manner I can't see this being any faster than running in serial. You would be better using OpenMP to do the for-loop reduction automatically.

However, it's a learning exercise for MPI, isn't it.

If you want to use entirely vectors for your arrays then you can try this.

#include <iostream>
#include <vector>
#include <mpi.h> 
using namespace std;

double my_dotproduct( const vector<double> &x, const vector<double> &y );

int main( int argc, char *argv[] )
{
   int num_procs, myrank;
   MPI_Init( &argc, &argv );
   MPI_Comm_size( MPI_COMM_WORLD, &num_procs );
   MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
   bool root = ( myrank == 0 );

   const int N = 14;
   vector<double> a, b;
   if ( root )                              // Only root processor will know contents of a and b ...
   {
      a.resize( N );   b.resize( N );
      for ( int i = 0; i < N; i++ )
      {
         a[i] = i;
         b[i] = 10;
      }
   }

   double result = my_dotproduct( a, b );   // ... but all processors call the routine and receive result
   if ( root ) cout << "Result is " << result << '\n';

   MPI_Finalize();
}

//=======================================================================

double my_dotproduct( const vector<double> &x, const vector<double> &y )
{
   int myrank, num_procs;
   MPI_Comm_size( MPI_COMM_WORLD, &num_procs );
   MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

   int n = x.size();
   MPI_Bcast( &n, 1, MPI_INT, 0, MPI_COMM_WORLD );   // only root processor knows the correct size

   int low = n / num_procs;                      // minimum number of data points for each processor
   int leftOver = n % num_procs;                 // extra points to be assigned to the first few processors
   int local_n = low + ( myrank < leftOver );

   vector<int> scounts( num_procs );
   vector<int> displs( num_procs, 0 );
   for ( int i = 0; i < num_procs; i++ ) scounts[i] = low + ( i < leftOver );
   for ( int i = 1; i < num_procs; i++ ) displs[i] = displs[i-1] + scounts[i-1];

   // Scatter data to all processors (unequal distribution, so MPI_Scatterv)
   vector<double> local_x( local_n ), local_y( local_n );
   MPI_Scatterv( x.data(), scounts.data(), displs.data(), MPI_DOUBLE, local_x.data(), local_n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
   MPI_Scatterv( y.data(), scounts.data(), displs.data(), MPI_DOUBLE, local_y.data(), local_n, MPI_DOUBLE, 0, MPI_COMM_WORLD);

   // Calculate LOCAL dotproduct on each processor
   double local_sum = 0.0;
   for ( int i = 0; i < local_n; i++ ) local_sum += local_x[i] * local_y[i];

   // Reduction operation; result broadcast to all processors using MPI_Allreduce
   double result;
   MPI_Allreduce( &local_sum, &result, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD );

   return result;
}

"C:\Program Files\Microsoft MPI\bin"\mpiexec -n 3 test.exe 
Result is 910

However, there comes a point where dealing with your own memory allocation (and deallocation) is a lot less hassle.

#include <iostream>
#include <vector>
#include <mpi.h> 
using namespace std;

double my_dotproduct( double * x, double * y, int n );

int main( int argc, char *argv[] )
{
   int num_procs, myrank;
   MPI_Init( &argc, &argv );
   MPI_Comm_size( MPI_COMM_WORLD, &num_procs );
   MPI_Comm_rank( MPI_COMM_WORLD, &myrank );
   bool root = ( myrank == 0 );

   const int N = 14;
   vector<double> a, b;
   if ( root )                                             // Only root processor will know contents of a and b ...
   {
      a.resize( N );   b.resize( N );
      for ( int i = 0; i < N; i++ )
      {
         a[i] = i;
         b[i] = 10;
      }
   }

   double result = my_dotproduct( a.data(), b.data(), N ); // ... but all processors call the routine
   if ( root ) cout << "Result is " << result << '\n';

   MPI_Finalize();
}

//=======================================================================

double my_dotproduct( double * x, double * y, int n )
{
   int myrank, num_procs;
   MPI_Comm_size( MPI_COMM_WORLD, &num_procs );
   MPI_Comm_rank( MPI_COMM_WORLD, &myrank );

   int low = n / num_procs;                      // minimum number of data points for each processor
   int leftOver = n % num_procs;                 // extra points to be assigned to the first few processors
   int local_n = low + ( myrank < leftOver );

   int * scounts = new int[num_procs];
   int * displs  = new int[num_procs]{};
   for ( int i = 0; i < num_procs; i++ ) scounts[i] = low + ( i < leftOver );
   for ( int i = 1; i < num_procs; i++ ) displs[i] = displs[i-1] + scounts[i-1];

   // Scatter data to all processors (unequal distribution, so MPI_Scatterv)
   double * local_x = new double[local_n];
   double * local_y = new double[local_n];
   MPI_Scatterv( x, scounts, displs, MPI_DOUBLE, local_x, local_n, MPI_DOUBLE, 0, MPI_COMM_WORLD);
   MPI_Scatterv( y, scounts, displs, MPI_DOUBLE, local_y, local_n, MPI_DOUBLE, 0, MPI_COMM_WORLD);

   // Calculate LOCAL dotproduct on each processor
   double local_sum = 0.0;
   for ( int i = 0; i < local_n; i++ ) local_sum += local_x[i] * local_y[i];

   // Reduction operation; result broadcast to all processors using MPI_Allreduce
   double result;
   MPI_Allreduce( &local_sum, &result, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD );

   // TIDY UP!!!!
   delete [] scounts;   delete [] displs;
   delete [] local_x;   delete [] local_y;

   return result;
}

"C:\Program Files\Microsoft MPI\bin"\mpiexec -n 3 test.exe 
Result is 910

Last edited on

Cplusc (457)

Thanks for your help. About the memory leak you are right, but when I used "delete" in the code, my program crashed, that's why I temporarily take out that part from code.
What is the problem with passing the vectors like I did?

you don't pass vectors like that. int main() should call it with just the name of the vector, whilst your dot product routine can receive it by constant reference.

And another thing that is not clear for me is that how each processor knows its own local_n. I saw a lot of examples in scatter and scatterv and I followed the same rule for obtaining the local_n,but I don't understand how it works.

lastchance (6980)

What is the problem with passing the vectors like I did?

1. You are using a pointer when it is unnecessary, non-idiomatic and unhelpful. It's a C-style being used for a C++ object.
2. More seriously, when you tried using it with MPI_Scatterv you appeared to believe that, as a pointer, x would point to the data buffer. It doesn't: if passed to the routine as you did then x would point to the vector object itself.

how each processor knows its own local_n

All processors calculate it in the same way, but depending on their rank.

Take your example: 14 pieces of data distributed unequally but as balanced as possible amongst 3 processors. This means that
processor 0 gets 5 pieces of data
processor 1 gets 5 pieces of data
processor 2 gets 4 pieces of data
Just the first 14 mod 3 = 2 processors get an extra item. All the rest get 14 // 3 = 4 pieces of data. Your distribution method is OK, just written in a slightly long-winded way.

Overall, you should aim to keep:
- expensive evaluations (like sqrt, exp, log, pow, cos) to a minimum;
- number of multiplies as small as possible (especially where adding or subtracting would achieve the same thing);
- in the context of parallel processing with MPI ... as few transfers as possible: your processors might not even be in the same room.

jonnin (11497)

also see if you can find a way to not do new/delete pairs each function. That is also costly.
The easiest way to deal with this is to PASS IN the memory blocks you need instead of allocate/delete them each time. then the calling function can keep them around for many calls to the function, or get new ones if the sizes change, and so on, but it will minimize all that.

if your processors are split and cannot share memory easily, that may not work, in which case an uglier approach of making the pointers static and a parameter telling you to re-allocate them may be required. I am not sure what the best approach for MPI is for this issue, but if you can find a way to recycle memory, it will help.

Last edited on

Cplusc (457)

Thank you all for your help.

Cplusc (457)

@lastchance
Thanks for this point,
"More seriously, when you tried using it with MPI_Scatterv you appeared to believe that, as a pointer, x would point to the data buffer. It doesn't: if passed to the routine as you did then x would point to the vector object itself."
It was my main problem.

Last edited on

Topic archived. No new replies allowed.