MPI Blocking and Non_Blocking Communication

I'm currently working on a lattice Boltzmann code (D3Q27) employing MPI for parallelization. I've implemented MPI 3D topology for communication, and my code snippet handles communication as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
void Simulation::Communicate(int iter) {

    int tag_xp = 0;
    int tag_xm = 1;
    int tag_yp = 2;
    int tag_ym = 3;
    int tag_zp = 4;
    int tag_zm = 5;

    MPI_Status status;

    if (SubDomain_.my_right_ != MPI_PROC_NULL) {
        std::vector<double> send_data;
        for (int k = 0; k < SubDomain_.my_Nz_; k++) {
            for (int j = 0; j < SubDomain_.my_Ny_; j++) {

                if (SubDomain_.lattice_[SubDomain_.my_Nx_ - 2][j][k] == nullptr) {
                    for (int dir = 0; dir < _nLatNodes; dir++) {

                        send_data.push_back(0.0);

                    }
                }
                else {

                    for (int dir = 0; dir < _nLatNodes; dir++) {
                        send_data.push_back(SubDomain_.lattice_[SubDomain_.my_Nx_ - 2][j][k]->m_distributions[dir]);

                    }
                }

            }
        }

        std::vector<double> recv_data(send_data.size());

        MPI_Sendrecv(send_data.data(), send_data.size(), MPI_DOUBLE, SubDomain_.my_right_, tag_xp,
            recv_data.data(), recv_data.size(), MPI_DOUBLE, SubDomain_.my_right_, tag_xm,
            MPI_COMM_WORLD, &status);


        int index = 0;
        for (int k = 0; k < SubDomain_.my_Nz_; k++) {
            for (int j = 0; j < SubDomain_.my_Ny_; j++) {
                for (int dir = 0; dir < _nLatNodes; dir++) {
                    SubDomain_.lattice_[SubDomain_.my_Nx_ - 1][j][k]->m_distributions[dir] = recv_data[index];
                    index++;
                }
            }
        }

    }


    if (SubDomain_.my_left_ != MPI_PROC_NULL) {
        std::vector<double> send_data;
        for (int k = 0; k < SubDomain_.my_Nz_; k++) {
            for (int j = 0; j < SubDomain_.my_Ny_; j++) {

                if (SubDomain_.lattice_[1][j][k] == nullptr) {
                    for (int dir = 0; dir < _nLatNodes; dir++) {
                        send_data.push_back(0.0);
                    }
                }
                else {

                    for (int dir = 0; dir < _nLatNodes; dir++) {
                        send_data.push_back(SubDomain_.lattice_[1][j][k]->m_distributions[dir]);
                    }

                }
            }
        }

        std::vector<double> recv_data(send_data.size());
        MPI_Sendrecv(send_data.data(), send_data.size(), MPI_DOUBLE, SubDomain_.my_left_, tag_xm,
            recv_data.data(), recv_data.size(), MPI_DOUBLE, SubDomain_.my_left_, tag_xp,
            MPI_COMM_WORLD, &status);
        int index = 0;
        for (int k = 0; k < SubDomain_.my_Nz_; k++) {
            for (int j = 0; j < SubDomain_.my_Ny_; j++) {
                for (int dir = 0; dir < _nLatNodes; dir++) {
                    SubDomain_.lattice_[0][j][k]->m_distributions[dir] = recv_data[index];
                    index++;
                }
            }
        }
    }
}


I have the same structure for the communication between front-back and up-down.

While I can verify that communication occurs correctly by printing sent and received data, upon visualization, it appears that the data might not be transferring to neighboring processors as expected, despite not being zeroed out (as previously confirmed through printing). After each iteration, I visualize the velocity components obtained via the Lattice Boltzmann Method (LBM). My observation reveals that the fluid dynamics are solely resolved within the processor featuring the inlet boundary condition, while all other processors exhibit a velocity of zero. This suggests that data transfer to neighboring processors might not be occurring as expected.

I have a couple of concerns:

Could data corruption arise from blocking communication?
Is diagonal communication necessary? My understanding is that if communication in the normal directions (x, y, and z) is established, diagonal communication implicitly occurs.
Additionally, I'm uncertain about the order of communication. Do all communications happen simultaneously, or is it sequential (e.g., right and left, then front and back, then up and down)? If they're not simultaneous, would diagonal communication be required?

I'd appreciate any insights to clarify these points of confusion. Thank you!
Last edited on
Any idea?
Consider this line of code:
1
2
3
MPI_Sendrecv(send_data.data(), send_data.size(), MPI_DOUBLE,SubDomain_.my_right_, tag_xp,
            recv_data.data(), recv_data.size(), MPI_DOUBLE, SubDomain_.my_right_, tag_xm,
            MPI_COMM_WORLD, &status);

The man page
https://www.open-mpi.org/doc/v4.0/man3/MPI_Sendrecv.3.php
indicates that the destination and source of the message are the same (the right node, SubDomain_.my_right_). Could that be the problem, or part of it?
The same pattern is repeated for the left side.

Note I have zero experience with MPI -- I just read about it for five minutes.
Last edited on
@mbozzi
Thank you. I think it's correct. look at https://rookiehpc.org/mpi/docs/mpi_sendrecv/index.html
Last edited on
Isn't it the case that a process to the right of a sender should receive from its left?

For example if there's two processes L and R
L <---> R

Process L should send a message to its right;
Process R should receive that message from its left.

I think the way it's written process R will try to receive from its right.
Last edited on
@mbozzi

that a process to the right of a sender should receive from its left
this means there is a processor which has a processor in its left, and this will be handled SubDomain_.my_left_ != MPI_PROC_NULL

I think this

Process L should send a message to its right;
Process R should receive that message from its left. 

will be handled by the conditions SubDomain_.my_right_ != MPI_PROC_NULL: send the data to the right processor and also receive from it.

and SubDomain_.my_left_ != MPI_PROC_NULL: send the data to the left processor and also receive from it.

look at this like each internal processor needs to send and receive data to and from its all 6 neighbor processors
Last edited on
This will be handled by the conditions SubDomain_.my_right_ != MPI_PROC_NULL: send the data to the right processor and also receive from it.

Ok, thank you. I think I understand now.

In your algorithm, I assume there's a set up stage where you distribute the initial state of the problem out to all the subprocesses. Then an iterative process where each iteration all subprocesses need to communicate only with their immediate neighbors. Then finally the results are collated.

Am I right at all? Could you explain a little how the algorithm works overall?

Have you checked for silly mistakes in the initialization & finalization steps? What about the place that you actually print out the results?

Could data corruption arise from blocking communication?

Do you mean corruption in the sense that the message is changed in transit? Unlikely, but possible. This issue is almost certainly a programming error.

I don't see anything wrong with the code you posted from a C++ perspective.

Is diagonal communication necessary? My understanding is that if communication in the normal directions (x, y, and z) is established, diagonal communication implicitly occurs.

Consider a flood-fill algorithm on a 2D grid:

X . .  -->  X X .  -->  X X X 
. . .       X . .       X X . 
. . .       . . .       X . .

It takes two iterations for the information from the upper left to propagate diagonally to the middle. Whether this is desirable depends on the algorithm.

Additionally, I'm uncertain about the order of communication. Do all communications happen simultaneously, or is it sequential (e.g., right and left, then front and back, then up and down)?

I think that communication is sequential. MPI is "Message Passing Interface" and at a basic level it just tells the computer to send memory (messages) between processes. I think that each MPI_Send call creates exactly one message.
Last edited on
@mbozzi thanks for your help.

In your algorithm, I assume there's a set up stage
where you distribute the initial state of the problem out 
to all the subprocesses. Then an iterative process where 
each iteration all subprocesses need to communicate only 
with their immediate neighbors. Then finally the results
are collated.

Am I right at all? Could you explain a little how the
algorithm works overall?


Yes, this is right. First the domain is distributed among MPI processors (3D topology). Then each processor has its own part of the domain, each processor knows its immediate neighbor in each direction and can communicate.

Have you checked for silly mistakes in the initialization
& finalization steps? What about the place that you actually print 
out the results?


I have been thinking the same but I've checked the code many times, nothing seems to be wrong.


Consider a flood-fill algorithm on a 2D grid:

X . .  -->  X X .  -->  X X X 
. . .       X . .       X X . 
. . .       . . .       X . .

It takes two iterations for the information from the
upper left to propagate diagonally to the middle. 
Whether this is desirable depends on the algorithm.


This is actually a LBM code solving blood flow inside aneurysm. This is a complex geometry, we build the fluid points in complex geometry using Collison-detection algorithm. then for LBM part we did the collision and streaming locally. There will be some processors that do not have any fluid points but they take part in communication for now (later I will fix this). So the communication happen at the boundary of all processors. The boundary of each processor also is covered by halo lattices. it might help you to be able to get any idea of what is wrong in the code. I also think without diagonal communication there will be a lagging between collision and streaming step and I have no idea if that can cause any issue.
Last edited on
Topic archived. No new replies allowed.