Scyld ClusterWare HPC: Programmer's Guide
<< Previous		Next >>

BProc Overview

The heart of the Scyld ClusterWare single system image is the BProc system. BProc consists of kernel modules and daemons that work together to create a unified process space and implement directed process migration. BProc is used to create a single cluster process space, with processes anywhere on the cluster appearing on the master as if they were running there. This allows users to monitor and control processes using standard, unmodified Unix tools.

The BProc system provides a library API that is used to create and manage processes on the compute nodes, as well as provide cluster membership state and limit node access. These calls are used by libraries such as MPI and PVM to create and manage tasks. They may also be used directly by the programmer to implement parallel software that is less complex and more functional than most other cluster programming interfaces.

For applications that have intense, complex communication patterns working on a single end goal, it is normally advisable to use message passing libraries such as MPI or PVM. The MPI and PVM interfaces are portable, provide a rich set of communication operations, and are widely understood among application programmers. The Scyld ClusterWare system provides specialized implementations of PVM and MPI that internally use the BProc system to provide cleaner semantics and higher performance.

There are many applications where it is not appropriate to use these message passing libraries. Directly using the native BProc API can often result in a much less complex implementation or provide unique functionality difficult to achieve with a message passing system. Most programming libraries, environments, tools, and services are better implemented directly with the BProc API.

A master node in a Scyld ClusterWare system manages a collection of processes that exist in a distinct process space. Each process in this process space has a unique PID, and can be controlled using Unix signals. New processes created by a process in a given process space remain in the process space as children of the process that created them. When processes terminate, they notify the parent process through a signal. BProc allows a process within a master node's process space to execute on a compute node. This is accomplished by starting a process on the compute node — which technically is outside the master node's process space — and binding it to a "ghost" process on the master node. This ghost process represents the process on the master node, and any operations performed on the ghost process are, in turn, performed on the remote process. Further, the remote node's process table must be annotated to indicate that the process is actually bound to the process space of the master. Thus, if the process performs a system call relative to its process space, such as creating a new process, or calling the kill() system call, that system call is performed in the context of the master's process space.

A process running on a compute node has a PID in the process space of the compute node, and an independent process running on the compute node can see that process with that PID. From the perspective of the process itself, it appears to be executing on the master node and has a PID within the master node's process space.

The BProc API is divided into three groups of functions. First, the machine information functions provide a means of discovering what compute nodes are available to the program, what the status of nodes is, what the current node is, etc. Second, the process migration functions provide for starting processes, moving processes, and running programs on compute nodes. Finally, a set of process management functions allows programs to manage how processes are run on compute nodes.

Starting a Process with BProc

The most fundamental task one performs with BProc is to start a new process. For traditional Linux programs the mechanism for starting a new process is fork(). Using fork(), a Linux process creates an exact duplicate of itself, right up to and including the contents of memory and the context of the program. The new process is a child of the original process and has a new process ID which is returned to the original process. The fork() call returns zero to the new process.

BProc provides a variation on the fork() system call:

	int bproc_rfork(int node);

This function behaves like Linux fork(), except that the process is started on a remote node specified by the node argument, and has a ghost process created on the master node. The function returns the PID of the ghost process to the original process and zero to the new process.

Note that there are important differences between Linux fork() and bproc_rfork(). BProc has a limited ability to deal with file descriptors when a process forks. Under Scyld ClusterWare, files are opened directly on the compute node, thus a file opened on one node before a call to bproc_rfork() does not translate cleanly into an open file on the remote node. Similarly, open sockets do not move when processes move and should not be expected to.

A process started with bproc_rfork does have stdout and stderr connected to the stdout and stderr of the original process and IO is automatically forwarded back to those sockets. Variations on the bproc_rfork() call allow the programmer to control how IO is forwarded for stdout, stderr, and stdin, and can also control how shared libraries are moved for the process. See the BProc reference manual for more details.

The bproc_rfork() call is a powerful mechanism for starting tasks that will act as part of a parallel application. Most applications that are written for Beowulf class computers consist of a single program that is replicated to all of the tasks that will run as part of that job. Each task identifies itself with a unique number, and in doing so each selects part of the total computation to perform. The bproc_rfork() call works well in this scenario because a program can readily duplicate itself using the bproc_rfork() system call. In order to coordinate the tasks, it is necessary to establish communication between the tasks and between each task and the original process. Again, the semantics of bproc_rfork() are ideal because the original task can set up a communication endpoint (such as a socket) before it calls bproc_rfork() to start the tasks at which point each task will automatically know the address of the original process and can establish communication. Once this is accomplished, the original process can exchange information between the tasks so that they can communicate among themselves.

	/* Create running processes on nodes 1, 5, 24, and 65. */
	int target_nodes[] = { 1, 5, 24, 65};
	for (i = 0; i < tasks; i++) 
	{
		int pid = bproc_rfork(target_nodes[i]);
		if (pid == 0)	/* I'm the child */
			break;	/* only the parent does bproc_rfork(). */
	}

Thus the bproc_rfork() call forms the core of parallel processing libraries such as MPI or PVM by providing a convenient means for starting tasks and establishing communication. This same facility can be used to start parallel tasks when a non-MPI (or non-PVM) program is desired. While this is not as portable and convenient — and thus not well suited to most applications — it may be appropriate for some system services, library development, and toolkits.

BProc does not limit the programmer to using fork() semantics for starting processes. Some situations call for starting a new program from an existing process much like the Linux system call execve(). For this, BProc provides two different calls:

	int bproc_rexec(int node, char *cmd, char **argv, char **envp);
	int bproc_execmove(int node, char *cmd, char **argv, char **envp);

These two functions are almost identical in that they cause the current process to be overwritten by the contents of the executable file specified by cmd which is then executed from the beginning with the arguments and environment given in argv and envp. In this the semantics are very similar to the Linux execve() system call. The difference is that the process moves to the node specified, leaving a ghost process on the master node. All of the same caveats that apply to bproc_rfork() apply to these calls as well, and there are other variations that allow additional control of IO forwarding and library migration.

The difference between these two calls is where the arguments are evaluated. With bproc_rexec(), the executable file is expected to reside on the compute node where the process will run. Thus with this call, the process first moves to the new node, and then loads the new program. With the bproc_execmove() call, the executable file is expected to reside on the node where the original process resides. Thus, this call loads the new program first, and then moves the process to the new node. In the case where the executable file resides on a network file system that is visible from all nodes, the effect of these two calls is essentially the same, though the bproc_execmove() call may be more efficient if the original process exists on the node where the executable file actually resides.

Finally, BProc provides a means to simply move a process from one node to another:

	int bproc_move(int node);

This call does not create a new process, other than possibly creating a ghost process, but simply causes the original process to move to a different node. Again, all caveats of bproc_fork() apply. This call is probably of most value in load balancing mechanisms but may be useful in situations where each task in a program needs to perform some IO on the master node to load its memory before moving out to a remote node.

<< Previous	Home	Next >>
Introduction		Getting Information About the Parallel Machine