Fist of all I apologize if the question is not purely programming.
Let's say I'm working with 2 MPI processors and I'm supposed to do a matrix vector multiplication. The size of matrix is 4*4 and stored in CSR format and each processor has part of it. Let's say each processor gives a 3*3 matrix which (stored in CSR format), but it's not distributed row-wise or column-wise. Before this I shared whole vector in all processors and after doing sub matrix vector multiplication used MPI_Reduce or MPI_Allreduce and I got the result. Now, Each processor is doing matrix[3][3]*vector[3] operation. It's fine till here but when I collect the result I get the wrong answer.
No, this is a simple example just helping to clear the point. Actually I'm working on a system of equations of order 1e-6.
If you've worked with finite element before, maybe it helps to say this comes from fem stiffness matrix. Imagine you have one million elements, you divide the elements between processors, so each processor will have part of stiffness matrix. Then in some places I need to do a matrix-vector multiplication(let say in order to get residual in CG solver). I already have the matrix distributed between processors but instead of sharing whole vector for efficiency purpose, I decided to assign each processor just those values of vector which are required, the rest of the vector is not needed because it will multiply by zero as I said the stiffness matrix is distributed .That's why I made this simple example, matrix 4×4 (2 elements, connectivity element 0: 0,1,2 and connectivity element 1:0,2,3. Processor 0 will receive connectivity of element zero and processor 1 element 1.
Matrix:
3 2 4 3
2 1 3 4
3 4 2 6
3 5 1 4
Vector:
2
1
3
4
Result:
32
30
40
27
Running with one processor you will get the result as above.
Now you have part of you equations in each processor.
Proc 0 contains x0,x1 and x2 and the coefficient matrix is
3/2 2 4/2
2 1 3
3/2 4 2/2
And proc 1 contains x0,x2 and x3 and the coefficient matrix is
3/2 4/2 3
3/2 2/2 6
3/2 1/2 4
Now my question is how I do this sub operation in a way that I get the result.
I suspect we gonna have to see some code. If it works on 1 thread and not split out, its probably a code problem -- the answer may be a mutex somewhere, or factoring some of it out to do before the threading starts, or something like that. Could be as simple as overwriting the same result cell.
@lastchance sorry for typo. You're right. Is that any way to split a big system of linear equations into diferent sub-systems and solve them separately and then somehow collect the result? I'm not talking about giving bounch of equations to each processor, in my case the equations will change.
yes, you can split matrix multiply.
at the very least you can hand each thread one row and one column to crunch. For very large matrices, that may be enough, but thread spam creates overhead.
I would probably hand one row and the whole other matrix to each thread for a good sized multiply. You still want to zero optimize so that it immediately stops the loops if it hits a zero and moves to the next value, such that add 0*x + 0*y + 0*.... skips an entire loop's worth of work.
You can try to put this directly into the target cells, so each thread needs to know what part of the result it is working on, or you can stuff your results into a vector and copy them over outside the thread, slightly less efficient but in terms of total work done this is small.
I highly advise you transpose the second matrix first. then the threads iterate across it in rows, which is a good thing.
I can't do a row or column wise spliting, in this case I know how to get it done. The problem is I don't have any control over the splitting. Let say there is a system of linear equations, in my case the system of equations will change in processors.
Let say this is the original system,
When the coefficient is distributed in part between processors the equations will change, for example the first equation is not any more 3x0+2x1+4x2+3x3 ,for example it become ax0+bx1+cx3=something, and similar for the others, now as you see there is no x3 in the first equation. The whole point is I have a big linear system and instead of solving this at once I want to split it into different processors and each processor solve it's own system and at the end I collect the result. The problem is I'm actually removing some unknown from some equations without imposing any condition for that and I don't know how to handle that.
Problem 1: Your matrix doesn't look anything like the global stiffness matrix in FEM, which is probably extremely sparse.
Problem 2: The vector has size N; the matrix has size NxN, which is much bigger. If it's simply a problem of doing a matrix multiply then let ALL processors know the WHOLE vector, but split the matrix row-wise. The first processor then ends up with the upper half of the solution and the second processor ends up with the bottom half. MPI_gatherv would then recover the whole.
@lastchance This is just a simple case, let's say two elements and 4 nodes. Let's say running with two processors, each processor receives one element containing 3 nodes. So the local stifness matrix will be 3*3. Yes, the stifness matrix is extremely sparse and I'm using csr format to store that. This csr storage also is local. I already shared the whole vector and it worked well. I'm trying to optimize the code now and for that I'm trying to avoid sharing whole vector in all processors. The matrix already is splitted when I distribute the elements between processors. Each processor has itsown stifness matrix. In this step I can share whole vector in all processors, do a matrix vector multiplication locally and the reduce the result in one processor. In this case either one processor is doing the job while other resting or all processors can do the same job (solving Ax=b). That's why I'm thinking about solving different system of equations in different processors and then collect the results. I hope I could make my point clear.
Will that be more workload on a machine in term of memory and cpu if all processors do the same job? In comparison with the case that only one processor works and other stoped by barrier.