Scyld ClusterWare HPC: Programmer's Guide
<< Previous		Next >>

Message Passing Interface (MPI)

The Message Passing Interface (MPI) is a standard interface for libraries that provide message passing services for parallel computing. There are multiple implementations of MPI for various parallel computer systems, including two widely used open source implementations, MPICH and LAM-MPI. The Scyld ClusterWare system software includes BeoMPI, our enhanced version of MPICH. For most operations, BeoMPI on a Scyld ClusterWare system is equivalent to running MPICH under Linux. There are a few places where ClusterWare's implementation of MPICH is different, and these will be pointed out in this chapter. This chapter also includes a basic introduction to MPI for new parallel programmers. Readers are referred to one of the standard texts on MPI or to the MPI Home Page: http://www.mcs.anl.gov/mpi

There are, in fact, two MPI standards. MPI-1 and MPI-2. MPI-2 is not actually a newer version of MPI, but a set of extensions to MPI-1. Most of what a typical parallel programmer needs is in MPI-1, with the possible exception of parallel IO, which is covered in MPI-2. Many implementations cover all of MPI-1 and some of MPI-2. At the time of this writing very few MPI implementations cover all of MPI-2, thus a new programmer should be wary of seeking out features of MPI-2 that, while interesting, may not be required for their particular problem.

MPI-1 includes all of the basic MPI concepts and in particular:

Point-to-point communication
Collective communication
Communicators
MPI datatypes

While MPI-2 covers a number of extensions and additional features such as:

MPI-IO
Single-ended communication
Connection-based communication
Issues such as threading

The MPI home page provides detailed specifications of all of the MPI features and references to a number of texts on the subject.

Fundamental MPI

Every MPI program must call two MPI functions. Almost every MPI program will call four additional MPI functions. Many MPI programs will not need to call any more than these six. Beyond these six, there are another dozen or so functions that will satisfy most MPI programs. Thus, this introduction takes the approach of introducing these functions in order of need. Hopefully, the new programmer will be able to progress through the text as the complexity of his programs increases.

The six most fundamental MPI function calls are as follows:

MPI_Init - start using MPI
MPI_Comm_size - get the number of tasks
MPI_Comm_rank - the unique index of this task
MPI_Send - send a message
MPI_Recv - receive a message
MPI_Finalize - stop using MPI

The first MPI call for all MPI programs must be to MPI_Init. This function initializes MPI for the program. Its arguments are the standard C command line arguments — which allows some implementations to check for MPI specific arguments, process them, and remove them (so your program doesn't have to parse MPI arguments). The last call every MPI program must make is to MPI_Finalize. Failure to call MPI_Finalize can result in some rather interesting run-time errors that can be quite difficult to debug. Below is an example of the minimum possible MPI program.

	#include <mpi.h>
	main(int argc, char **argv)
	{
			MPI_Init(&argc, &argv );
			// put program here
			MPI_Finalize();
	}

When an MPI program runs, there is not one, but many copies of exactly the same program. These programs are called "tasks" in this manual, and they are the means by which parallel processing is achieved. To get better performance, each task should run on a different processor — or compute node in a Scyld ClusterWare system. It is possible to write MPI programs that expect a fixed number of tasks to be running, but it is far better to write programs that can run correctly using any number of tasks. In this case, it is important to know how many tasks are running as part of the current collection of tasks or "jobs." Furthermore, it is often important to know which of those tasks is the currently running task. In other words, if there are 16 tasks running, each task should have a unique number from 0 to 15 that identifies that task. This number is called a "rank."

In fact, MPI includes a complex system for arbitrarily grouping sets of tasks together and giving each task within a group a rank that is unique relative to that group. This is an advanced concept that will not be covered here. Let it suffice to say that when an MPI job is created there is a group that consists of all of the tasks in that job. Furthermore, all functions that involve interaction among tasks in MPI do so through a structure called a "communicator" which can be thought of as a communications network between tasks in a group. The communicator associated with the global group that contains all processes in the job is known as:

	MPI_COMM_WORLD

This symbol appears in most of the function calls that follow. For the beginning MPI programmer, this argument is always the same. Considerably more advanced programmers can construct their own communicators for building much more complex programs, but this feature is not discussed in detail here.

Thus, most MPI programs need to know how many tasks there are in MPI_COMM_WORLD, and what their rank is within MPI_COMM_WORLD. The next two MPI function calls provide this information. These will often be called early in the execution of a program and the information saved in a variable of some sort, but they can be called repeatedly as needed through the execution of the program.

	int size, rank;
	MPI_Comm_size(MPI_COMM_WORLD, &size);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);

MPI_Comm_size returns the number of tasks in the job, while MPI_Comm_rank returns the rank of the current task (0 .. size-1). These values are typically used to determine what part of the computation each task will perform, and what is the rank of another task that the current task wants to communicate with. For example, if we wanted every task to communicate with the task whose rank is one greater than itself — and for the task whose ranks is size-1 to communicate with task 0, then the following could be used:

	int size, rank, partner;
	MPI_Comm_size(MPI_COMM_WORLD, &size);
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	partner = rank + 1 % size;

The most fundamental of all MPI functions are MPI_Send and MPI_Recv. These are the functions actually used to send and receive data between tasks. In both cases the functions include a pointer to a user buffer that contains the data to send (for MPI_Send) or will receive the data (for MPI_Recv), two arguments that identify the data, two arguments that identify the other task involved in the communication, and a special argument called a "tag." The following examples show these calls:

	[SCOUNT], rbuf[RCOUNT];
	MPI_Status status;
	MPI_Send(sbuf, SCOUNT, MPI_CHAR, rank+1%size, 99, MPI_COMM_WORLD);
	MPI_Recv(rbuf, RCOUNT, MPI_CHAR, rank+1%size, 99, MPI_COMM_WORLD, &status);

The arguments are identified as follows:

sbuf : pointer to send buffer
rbuf : pointer to receive buffer
SCOUNT : items in send buffer
RCOUNT : items in receive buffer
MPI_CHAR : MPI datatype
rank+1%size : source or destination task rank
99 : message tag
MPI_COMM_WORLD : communicator
&status : pointer to status struct

The first argument of each call is the pointer to the user's buffer. The second two arguments specify the number of those items stored contiguously in the buffer and the type of the data being sent or received. MPI provides pre-defined MPI_Datatypes for the standard scalar types:

	MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_FLOAT,
	MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE, MPI_PACKED

MPI also provides methods for creating user defined types that can represent all kinds of structures and records. These methods are not discussed here. The send buffer is assumed to be at least as big as SCOUNT times the size of the MPI_Datatype. When receiving messages, it is possible that the task may not know exactly what size the incoming message is. The arguments to the call specify how big the receive buffer is, and in this case the receive buffer must be big enough for the incoming message, or an error will occur and be reported through the MPI_Status struct. If the message is smaller than the buffer, then only part of the buffer will be filled in. MPI provides a function that allows the programmer to check the size of the next incoming message before receiving it:

	MPI_Probe(rank+1%size, 99, MPI_COMM_WORLD, &status); // src, tag, comm, stat

The next argument to MPI_Send and MPI_Receive (after the MPI_Datatype) is the rank of the task that the program wants to send to or receive from. For MPI_Recv this can be MPI_ANY_SOURCE which will allow messages to be received from any other task. The source task can be determined through the MPI_Status argument. After the rank argument is the message tag argument. All messages are sent with an integer tag. MPI_Receive will only receive a message sent with a tag that matches the tag specified in the MPI_Recv call. There isn't any predefined meaning to message tags. Some programs can be written with all messages using the same tag. In more complex programs the message tag allows the receiving task to control the order that messages are received. The choice of how the tags are used is entirely up to the programmer. MPI_Recv can receive a message with any tag by specifying MPI_ANY_TAG for the tag field.

The MPI_Comm argument specifies the group of tasks that is communicating. Messages sent with a given communicator can only be received using the same communicator. The rank specified as the destination or source of the message is interpreted in terms of the communicator specified. Unless the programmer is working with multiple communicators (an advanced topic) this should always be MPI_COMM_WORLD.

MPI_Recv has one more argument than MPI_Send. The MPI_Status struct defines three integer fields that return information to the program about a prior call to MPI_Recv. The fields are:

	MPI_Status status;
	printf( ... , status.MPI_SOURCE); // the source rank of the message
	printf( ... , status.MPI_TAG);    // the message tag
	printf( ... , status.MPI_ERROR);  // error codes

These are particularly useful if the program uses MPI_ANY_SOURCE or MPI_ANY_TAG. In addition, the status can be used to get a count of the actual number of items received using another MPI function call:

	int count;
	MPI_Get_count(&status, MPI_CHAR, &count)

This call is useful when the receiver doesn't know the actual size of the message being received.

When a task calls either MPI_Send or MPI_Recv, the corresponding message is said to be "posted." In order for a message to be delivered, an MPI_Send must be matched with a corresponding MPI_Recv. Until a given message is matched, it is said to be "pending." There are a number of rules governing how MPI_Sends and MPI_Recvs can be matched in order to complete delivery of the messages. The first set of rules are the obvious ones that specify that the message must have been sent to the receiver and received from the sender and they must be within the same communicator and the tags must match. Of course MPI_ANY_TAG matches any tag and MPI_ANY_SOURCE matches any sender. These however, only form the basis for matching messages. It is possible to have many messages posted at the same time, and it is possible that some posted MPI_Sends may match many MPI_Recvs and that some MPI_Recvs may match many MPI_Sends.

This comes about for a variety of reasons. On the one hand, messages can be sent and received from multiple tasks to multiple tasks. Thus all tasks could post an MPI_Send to a single task. In this case MPI makes no assumptions or guarantees as to which posted message will match an MPI_Recv. This, in fact, potentially creates non-determinism within the parallel program as one generally cannot say what order the messages will be received (assuming the MPI_Recv uses MPI_ANY_SOURCE to allow it to receive any of the posted messages). As stated in the MPI specification, fairness is not guaranteed. On the other hand, it is possible for more than one message to be sent from the same source task to a given destination task. In this case, MPI does guarantee that the first message sent will be the first message received, assuming both messages could potentially match the first receive. Thus MPI guarantees that messages are non-overtaking.

A more subtle issue is that of guaranteeing progress. Under MPI, if one task posts an MPI_Send that matches an MPI_Recv posted by another task, then at least one of those two operations must complete. It is possible that the two operations will not be matched to each other — if, for example, the MPI_Recv matches a different MPI_Send or the MPI_Send matches a different MPI_Recv, but in each of those cases at least one of the original two calls will complete. Thus, it is not possible for the system to hang with potentially matching sends and receives pending.

Finally, MPI addresses the issue of buffering. One way that a task can have two different MPI_Sends posted at the same time (thus leading to the situations described above) is that the MPI_Send function returns to the program (thus allowing the program to continue and potentially make another such call) even though the send is still pending. In this case the MPI library or the underlying operating system may have copied the user's buffer into a system buffer — thus the data can be said to be "in" the communication system. Most implementations of MPI provide for a certain amount of buffering of messages and this is allowed for standard MPI_Send calls. However, MPI does not guarantee that there will be enough system buffers for the calls made, or any buffering at all. Programs that expect a certain amount of buffering for standard MPI_Send calls are considered incorrect. As buffering is an important part of many message passing systems, MPI does provide mechanisms to control how message buffering is handled, thus giving rise to the various semantic issues described here.

To summarize, MPI provides a set of semantic rules for point-to-point message passing as follows:

MPI_Send and MPI_Recv must match
Fairness is not guaranteed
Non-overtaking messages
Progress is guaranteed
System buffering resources are not guaranteed

These semantic issues play an even bigger role when asynchronous send and receive are employed. This is a more advanced form of message passing that is very powerful and important for many applications. Asynchronous send and receive can quickly lead to all of the semantic issues listed above. The details of this are not discussed here.

<< Previous	Home	Next >>
Detailed CPU State and Status		More MPI Point-to-Point Features