Scyld ClusterWare HPC: Programmer's Guide | ||
---|---|---|
<< Previous | Next >> |
The Message Passing Interface (MPI) is a standard interface for libraries that provide message passing services for parallel computing. There are multiple implementations of MPI for various parallel computer systems, including two widely used open source implementations, MPICH and LAM-MPI. The Scyld ClusterWare system software includes BeoMPI, our enhanced version of MPICH. For most operations, BeoMPI on a Scyld ClusterWare system is equivalent to running MPICH under Linux. There are a few places where ClusterWare's implementation of MPICH is different, and these will be pointed out in this chapter. This chapter also includes a basic introduction to MPI for new parallel programmers. Readers are referred to one of the standard texts on MPI or to the MPI Home Page: http://www.mcs.anl.gov/mpi
There are, in fact, two MPI standards. MPI-1 and MPI-2. MPI-2 is not actually a newer version of MPI, but a set of extensions to MPI-1. Most of what a typical parallel programmer needs is in MPI-1, with the possible exception of parallel IO, which is covered in MPI-2. Many implementations cover all of MPI-1 and some of MPI-2. At the time of this writing very few MPI implementations cover all of MPI-2, thus a new programmer should be wary of seeking out features of MPI-2 that, while interesting, may not be required for their particular problem.
MPI-1 includes all of the basic MPI concepts and in particular:
Point-to-point communication
Collective communication
Communicators
MPI datatypes
While MPI-2 covers a number of extensions and additional features such as:
MPI-IO
Single-ended communication
Connection-based communication
Issues such as threading
The MPI home page provides detailed specifications of all of the MPI features and references to a number of texts on the subject.
Every MPI program must call two MPI functions. Almost every MPI program will call four additional MPI functions. Many MPI programs will not need to call any more than these six. Beyond these six, there are another dozen or so functions that will satisfy most MPI programs. Thus, this introduction takes the approach of introducing these functions in order of need. Hopefully, the new programmer will be able to progress through the text as the complexity of his programs increases.
The six most fundamental MPI function calls are as follows:
MPI_Init - start using MPI
MPI_Comm_size - get the number of tasks
MPI_Comm_rank - the unique index of this task
MPI_Send - send a message
MPI_Recv - receive a message
MPI_Finalize - stop using MPI
#include <mpi.h> main(int argc, char **argv) { MPI_Init(&argc, &argv ); // put program here MPI_Finalize(); } |
In fact, MPI includes a complex system for arbitrarily grouping sets of tasks together and giving each task within a group a rank that is unique relative to that group. This is an advanced concept that will not be covered here. Let it suffice to say that when an MPI job is created there is a group that consists of all of the tasks in that job. Furthermore, all functions that involve interaction among tasks in MPI do so through a structure called a "communicator" which can be thought of as a communications network between tasks in a group. The communicator associated with the global group that contains all processes in the job is known as:
MPI_COMM_WORLD |
Thus, most MPI programs need to know how many tasks there are in MPI_COMM_WORLD, and what their rank is within MPI_COMM_WORLD. The next two MPI function calls provide this information. These will often be called early in the execution of a program and the information saved in a variable of some sort, but they can be called repeatedly as needed through the execution of the program.
int size, rank; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); |
int size, rank, partner; MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); partner = rank + 1 % size; |
[SCOUNT], rbuf[RCOUNT]; MPI_Status status; MPI_Send(sbuf, SCOUNT, MPI_CHAR, rank+1%size, 99, MPI_COMM_WORLD); MPI_Recv(rbuf, RCOUNT, MPI_CHAR, rank+1%size, 99, MPI_COMM_WORLD, &status); |
The arguments are identified as follows:
sbuf : pointer to send buffer
rbuf : pointer to receive buffer
SCOUNT : items in send buffer
RCOUNT : items in receive buffer
MPI_CHAR : MPI datatype
rank+1%size : source or destination task rank
99 : message tag
MPI_COMM_WORLD : communicator
&status : pointer to status struct
The first argument of each call is the pointer to the user's buffer. The second two arguments specify the number of those items stored contiguously in the buffer and the type of the data being sent or received. MPI provides pre-defined MPI_Datatypes for the standard scalar types:
MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE, MPI_PACKED |
MPI_Probe(rank+1%size, 99, MPI_COMM_WORLD, &status); // src, tag, comm, stat |
The MPI_Comm argument specifies the group of tasks that is communicating. Messages sent with a given communicator can only be received using the same communicator. The rank specified as the destination or source of the message is interpreted in terms of the communicator specified. Unless the programmer is working with multiple communicators (an advanced topic) this should always be MPI_COMM_WORLD.
MPI_Recv has one more argument than MPI_Send. The MPI_Status struct defines three integer fields that return information to the program about a prior call to MPI_Recv. The fields are:
MPI_Status status; printf( ... , status.MPI_SOURCE); // the source rank of the message printf( ... , status.MPI_TAG); // the message tag printf( ... , status.MPI_ERROR); // error codes |
int count; MPI_Get_count(&status, MPI_CHAR, &count) |
When a task calls either MPI_Send or MPI_Recv, the corresponding message is said to be "posted." In order for a message to be delivered, an MPI_Send must be matched with a corresponding MPI_Recv. Until a given message is matched, it is said to be "pending." There are a number of rules governing how MPI_Sends and MPI_Recvs can be matched in order to complete delivery of the messages. The first set of rules are the obvious ones that specify that the message must have been sent to the receiver and received from the sender and they must be within the same communicator and the tags must match. Of course MPI_ANY_TAG matches any tag and MPI_ANY_SOURCE matches any sender. These however, only form the basis for matching messages. It is possible to have many messages posted at the same time, and it is possible that some posted MPI_Sends may match many MPI_Recvs and that some MPI_Recvs may match many MPI_Sends.
This comes about for a variety of reasons. On the one hand, messages can be sent and received from multiple tasks to multiple tasks. Thus all tasks could post an MPI_Send to a single task. In this case MPI makes no assumptions or guarantees as to which posted message will match an MPI_Recv. This, in fact, potentially creates non-determinism within the parallel program as one generally cannot say what order the messages will be received (assuming the MPI_Recv uses MPI_ANY_SOURCE to allow it to receive any of the posted messages). As stated in the MPI specification, fairness is not guaranteed. On the other hand, it is possible for more than one message to be sent from the same source task to a given destination task. In this case, MPI does guarantee that the first message sent will be the first message received, assuming both messages could potentially match the first receive. Thus MPI guarantees that messages are non-overtaking.
A more subtle issue is that of guaranteeing progress. Under MPI, if one task posts an MPI_Send that matches an MPI_Recv posted by another task, then at least one of those two operations must complete. It is possible that the two operations will not be matched to each other — if, for example, the MPI_Recv matches a different MPI_Send or the MPI_Send matches a different MPI_Recv, but in each of those cases at least one of the original two calls will complete. Thus, it is not possible for the system to hang with potentially matching sends and receives pending.
Finally, MPI addresses the issue of buffering. One way that a task can have two different MPI_Sends posted at the same time (thus leading to the situations described above) is that the MPI_Send function returns to the program (thus allowing the program to continue and potentially make another such call) even though the send is still pending. In this case the MPI library or the underlying operating system may have copied the user's buffer into a system buffer — thus the data can be said to be "in" the communication system. Most implementations of MPI provide for a certain amount of buffering of messages and this is allowed for standard MPI_Send calls. However, MPI does not guarantee that there will be enough system buffers for the calls made, or any buffering at all. Programs that expect a certain amount of buffering for standard MPI_Send calls are considered incorrect. As buffering is an important part of many message passing systems, MPI does provide mechanisms to control how message buffering is handled, thus giving rise to the various semantic issues described here.
To summarize, MPI provides a set of semantic rules for point-to-point message passing as follows:
MPI_Send and MPI_Recv must match
Fairness is not guaranteed
Non-overtaking messages
Progress is guaranteed
System buffering resources are not guaranteed
<< Previous | Home | Next >> |
Detailed CPU State and Status | More MPI Point-to-Point Features |