Scyld ClusterWare HPC: Programmer's Guide
<< Previous	Message Passing Interface (MPI)	Next >>

More MPI Point-to-Point Features

MPI provides several different communication modes which allow the programmer to control the semantics of point-to-point communication. Specifically, the point-to-point communication modes allow the programmer to control how and when message buffering is done, and in some cases may allow the underlying network transport to optimize message delivery. These different modes are selected by using one of four different variations of the MPI_Send function. All of these variations have the same argument list and work the same way, except as to how the message delivery semantics are affected. Each of these match with the same standard MPI_Recv function all.

	// Standard Mode
	MPI_Send(buf, count, datatype, dest, tag, MPI_COMM_WORLD);
	// Buffered Mode
	MPI_Bsend(buf, count, datatype, dest, tag, MPI_COMM_WORLD);
	// Synchronous Mode
	MPI_Ssend(buf, count, datatype, dest, tag, MPI_COMM_WORLD);
	// Ready Mode
	MPI_Rsend(buf, count, datatype, dest, tag, MPI_COMM_WORLD);

In standard mode messages are buffered if the operating system has buffer space available. Otherwise, the call to MPI_Send will block until the message delivered or system space becomes available. In this case completion of the MPI_Send call does not imply that the message has been delivered, because it may be buffered by the operating system.

In buffered mode the programmer provides buffer space for messages by allocating user space memory and providing it to the MPI library for use in buffering messages. MPI_Bsend returns as soon as the message is copied into the buffer. If insufficient buffer space exists for the message, the call returns with an error. Thus, MPI_Bsend will not block as MPI_Send can. Completion of MPI_Bsend does not imply that the message has been delivered, because it can be buffered in the space provided.

MPI provides two functions for providing buffer space to the MPI library, and removing the buffer space.

	MPI_Buffer_attach(void *buffer, int size);
	MPI_Buffer_detach(void **buffer_addr, int *size);

MPI_Buffer_detach waits until all buffered messages are sent before it returns. After it returns, the buffer can be reused or freed as needed.

In synchronous mode MPI_Ssend blocks until a matching receive is posted and data is either buffered by the operating system or delivered to the destination task. Completion of MPI_Ssend does not imply completion of the corresponding MPI_Recv, but does imply the start of the MPI_Recv.

Ready Mode is a special communication mode used with systems that have network transports that can optimize throughput when sender and receiver are well synchronized. In this case, if the receiver is ready for the message (has already called MPI_Recv) when the send is made, data can be transfered directly between the user memory and the network device on both sending and receiving ends. MPI_Rsend can only be correctly called if the matching MPI_Recv has already been posted. MPI_Rsend completes when the message is either received or copied to a system buffer. On ClusterWare systems using traditional networking such as Ethernet, this call is no different than standard mode MPI_Send. Some advanced network technologies such as Myrinet may be able to make use of MPI_Rsend.

Many MPI applications work by having all of the tasks in the job exchange data, either in a ring of some sort, or with a specific partner task. In these cases each task must both send and receive data. Depending on the communication mode and composition of the messages, it is possible that this can result in deadlock. MPI provides a special send/recv primitive that both sends and receives a message, potentially with different tasks and guarantees that the semantics will not result in a deadlock.

	MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype,
		int dest, int sendtag,
		void *recvbuf, int recvcount, MPI_Datatype recvtype,
		int source, int recvtag,
		MPI_Comm comm, MPI_Status *status);
	MPI_Sendrecv_replace(void* buf, int count, MPI_Datatype datatype,
		int dest, int sendtag,
		int source, int recvtag,
		MPI_Comm comm, MPI_Status *status);

In MPI_Sendrecv, the programmer specifies a buffer, count, type, tag, and destination/source for each message with a common communicator and status. With MPI_Sendrecv_replace, a single buffer, count, and type are specified with a source/send tag and destination/receive tag. This allows data of the same type and size to be sent from and received into a single buffer.

<< Previous	Home	Next >>
Message Passing Interface (MPI)	Up	Unique Features of MPI under Scyld ClusterWare