Scyld ClusterWare HPC: User's Guide
<< Previous	Running Programs	Next >>

Running Parallel Programs

An Introduction to Parallel Programming APIs

Programmers are generally familiar with serial, or sequential, programs. Simple programs — like "Hello World" and the basic suite of searching and sorting programs — are typical of sequential programs. They have a beginning, an execution sequence, and an end; at any time during the run, the program is executing only at a single point.

A thread is similar to a sequential program, in that it also has a beginning, an execution sequence, and an end. At any time while a thread is running, there is a single point of execution. A thread differs in that it isn't a stand-alone program; it runs within a program. The concept of threads becomes important when a program has multiple threads running at the same time and performing different tasks.

To run in parallel means that more than one thread of execution is running at the same time, often on different processors of one computer; in the case of a cluster, the threads are running on different computers. A few things are required to make parallelism work and be useful: The program must migrate to another computer or computers and get started; at some point, the data upon which the program is working must be exchanged between the processes.

The simplest case is when the same single-process program is run with different input parameters on all the nodes, and the results are gathered at the end of the run. Using a cluster to get faster results of the same non-parallel program with different inputs is called parametric execution.

A much more complicated example is a simulation, where each process represents some number of elements in the system. Every few time steps, all the elements need to exchange data across boundaries to synchronize the simulation. This situation requires a message passing interface or MPI.

To solve these two problems — program startup and message passing — you can develop your own code using POSIX interfaces. Alternatively, you could utilize an existing parallel application programming interface (API), such as the Message Passing Interface (MPI) or the Parallel Virtual Machine (PVM). These are discussed in the sections that follow.

MPI

The Message Passing Interface (MPI) application programming interface is currently the post popular choice for writing parallel programs. The MPI standard leaves implementation details to the system vendors (like Scyld). This is useful because they can make appropriate implementation choices without adversely affecting the output of the program.

A program that uses MPI is automatically started a number of times and is allowed to ask two questions: How many of us (size) are there, and which one am I (rank)? Then a number of conditionals are evaluated to determine the actions of each process. Messages may be sent and received between processes.

The advantages of MPI are that the programmer:

Doesn't have to worry about how the program gets started on all the machines
Has a simplified interface for inter-process messages
Doesn't have to worry about mapping processes to nodes
Abstracts the network details, resulting in more portable hardware-agnostic software

Also see the section on running MPI-aware programs later in this chapter. Scyld ClusterWare includes several implementations of MPI, including the following:

MPICH. Scyld ClusterWare includes MPICH, a freely-available implementation of the MPI standard. MPICH is a project managed by Argonne National Laboratory and Mississippi State University. For more information on MPICH, visit the MPICH web site at http://www-Unix.mcs.anl.gov/mpi/mpich. Scyld MPICH is modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter.

MVAPICH. MVAPICH is an implementation of MPICH for Infiniband interconnects. As part of our Infiniband product, Scyld ClusterWare provides a version of MVAPICH, modified to use BProc and Scyld job mapping support; see the section on job mapping later in this chapter. For information on MVAPICH, visit the MVAPICH web site at http://nowlab.cse.ohio-state.edu/projects/mpi-iba.

OpenMPI. OpenMPI is an open-source implementation of the Message Passing Interface 2 (MPI-2) specification. The OpenMPI implementation is an optimized combination of several other MPI implementations, and is likely to perform better than MPICH. For more information on OpenMPI, visit the OpenMPI web site at http://www.open-mpi.org. Also see the section on running OpenMPI programs later in this chapter.

Other MPI Implementations. Various commercial MPI implementations run on Scyld ClusterWare. See the Scyld MasterLink site at http://www.penguincomputing.com/support/masterlink for more information. You can also download and build your own version of MPI, and configure it to run on Scyld ClusterWare.

PVM

Parallel Virtual Machine (PVM) was an earlier parallel programming interface. Unlike MPI, it is not a specification but a single set of source code distributed on the Internet. PVM reveals much more about the details of starting your job on remote nodes. However, it fails to abstract implementation details as well as MPI does.

PVM is deprecated, but is still in use by legacy code. We generally advise against writing new programs in PVM, but some of the unique features of PVM may suggest its use.

Also see the section on running PVM-aware programs later in this chapter.

Custom APIs

As mentioned earlier, you can develop you own parallel API by using various Unix and TCP/IP standards. In terms of starting a remote program, there are programs written:

Using the rexec function call
To use the rexec or rsh program to invoke a sub-program
To use Remote Procedure Call (RPC)
To invoke another sub-program using the inetd super server

These solutions come with their own problems, particularly in the implementation details. What are the network addresses? What is the path to the program? What is the account name on each of the computers? How is one going to load-balance the cluster?

Scyld ClusterWare, which doesn't have binaries installed on the cluster nodes, may not lend itself to these techniques. We recommend you write your parallel code in MPI. That having been said, we can say that Scyld has some experience with getting rexec() calls to work, and that one can simply substitute calls to rsh with the more cluster-friendly bpsh.

Mapping Jobs to Compute Nodes

Running programs specifically designed to execute in parallel across a cluster requires at least the knowledge of the number of processes to be used. Scyld ClusterWare uses the NP environment variable to determine this. The following example will use 4 processes to run an MPI-aware program called a.out, which is located in the current directory.

[user@cluster user] $ NP=4 ./a.out

Note that each kind of shell has its own syntax for setting environment variables; the example above uses the syntax of the Bourne shell (/bin/sh or /bin/bash).

What the example above does not specify is which specific nodes will execute the processes; this is the job of the mapper. Mapping determines which node will execute each process. While this seems simple, it can get complex as various requirements are added. The mapper scans available resources at the time of job submission to decide which processors to use.

Scyld ClusterWare includes beomap, a mapping API (documented in the Programmer's Guide with details for writing your own mapper). The mapper's default behavior is controlled by the following environment variables:

NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes.
ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs.
ALL_NODES — Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU.
ALL_LOCAL — Run every process on the master node; used for debugging purposes.
NO_LOCAL — Don't run any processes on the master node.
EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment.
BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on.

You can use the beomap program to display the current mapping for the current user in the current environment with the current resources at the current time. See the Reference Guide for a detailed description of beomap and its options, as well as examples for using it.

Running MPICH Programs

MPI-aware programs are those written to the MPI specification and linked with the Scyld MPICH library. This section discusses how to run MPI-aware programs and how to set mapping parameters from within such programs. Examples of running MPI-aware programs are also provided.

For information on building MPI programs, see the Programmer's Guide.

mpirun

Almost all implementations of MPI have an mpirun program, which shares the syntax of mpprun, but which boasts of additional features for MPI-aware programs.

In the Scyld implementation of mpirun, all of the options available via environment variables or flags through directed execution are available as flags to mpirun, and can be used with properly compiled MPI jobs. For example, the command for running a hypothetical program named my-mpi-prog with 16 processes:

[user@cluster user] $ mpirun -np 16 my-mpi-prog arg1 arg2

is equivalent to running the following commands in the Bourne shell:

[user@cluster user] $ export NP=16
[user@cluster user] $ my-mpi-prog arg1 arg2

Setting Mapping Parameters from Within a Program

A program can be designed to set all the required parameters itself. This makes it possible to create programs in which the parallel execution is completely transparent. However, it should be noted that this will work only with Scyld ClusterWare, while the rest of your MPI program should work on any MPI platform.

Use of this feature differs from the command line approach, in that all options that need to be set on the command line can be set from within the program. This feature may be used only with programs specifically designed to take advantage of it, rather than any arbitrary MPI program. However, this option makes it possible to produce turn-key application and parallel library functions in which the parallelism is completely hidden.

Following is a brief example of the necessary source code to invoke mpirun with the -np 16 option from within a program, to run the program with 16 processes:

/* Standard MPI include file */
# include <mpi.h>

main(int argc, char **argv) {
        setenv("NP","16",1); // set up mpirun env vars
        MPI_Init(&argc,&argv); 
        MPI_Finalize();
}

More details for setting mapping parameters within a program are provided in the Programmer's Guide.

Examples

The examples in this section illustrate certain aspects of running a hypothetical MPI-aware program named my-mpi-prog.

Example 7. Specifying the Number of Processes

This example shows a cluster execution of a hypothetical program named my-mpi-prog run with 4 processes:

[user@cluster user] $ NP=4 ./my-mpi-prog

An alternative syntax is as follows:

[user@cluster user] $ NP=4
[user@cluster user] $ export NP
[user@cluster user] $ ./my-mpi-prog

Note that the user specified neither the nodes to be used nor a mechanism for migrating the program to the nodes. The mapper does these tasks, and jobs are run on the nodes with the lowest CPU utilization.

Example 8. Excluding Specific Resources

In addition to specifying the number of processes to create, you can also exclude specific nodes as computing resources. In this example, we run my-mpi-prog again, but this time we not only specify the number of processes to be used (NP=6), but we also exclude of the master node (NO_LOCAL=1) and some cluster nodes (EXCLUDE=2:4:5) as computing resources.

[user@cluster user] $ NP=6 NO_LOCAL=1 EXCLUDE=2:4:5 ./my-mpi-prog

Running OpenMPI Programs

OpenMPI programs are those written to the MPI-2 specification. This section provides information needed to use programs with OpenMPI as implemented in Scyld ClusterWare.

Pre-Requisites to Running OpenMPI

A number of commands, such as mpirun, are duplicated between OpenMPI and other MPI implementations. Scyld ClusterWare provides the env-modules package which gives users a convenient way to switch between the various implementations. Be sure to load an OpenMPI module to favor OpenMPI, located in /usr/openmpi/, over the MPICH commands which are located in /usr/. Each module bundles together various compiler-specific environment variables to configure your shell for building and running your application, and for accessing compiler-specific manpages. Be sure that you are loading the proper module to match the compiler that built the application you wish to run. For example, to load the OpenMPI module for use with the Intel compiler, do the following:

[user@cluster user] $ module load openmpi/intel

Currently, there are modules for the GNU, Intel, PGI and Pathscale compilers. To see a list of all of the available modules:

[user@cluster user] $ module avail openmpi
------------------------------- /opt/modulefiles -------------------------------
openmpi/gnu(default) openmpi/path
openmpi/intel        openmpi/pgi

For more information about creating your own modules, see http://modules.sourceforge.net and the manpages man module and man modulefile.

Using OpenMPI

Unlike the Scyld ClusterWare MPICH implementation, OpenMPI does not honor the Scyld Beowulf job mapping environment variables. You must either specify the list of hosts on the command line, or inside of a hostfile. To specify the list of hosts on the command line, use the -H option. The argument following -H is a comma separated list of hostnames, not node numbers. For example, to run a two process job, with one process running on node 0 and one on node 1:

[user@cluster user] $ mpirun -H n0,n1 -np 2 ./mpiprog

Support for running jobs over Infiniband using the OpenIB transport is included with OpenMPI distributed with Scyld ClusterWare. Much like running a job with MPICH over Infiniband, one must specifically request the use of OpenIB. For example:

[user@cluster user] $ mpirun --mca btl openib,sm,self -H n0,n1 -np 2 ./myprog

Read the OpenMPI mpirun man page for more information about, using a hostfile, and using other tunable options available through mpirun.

Running PVM-Aware Programs

Parallel Virtual Machine (PVM) is an application programming interface for writing parallel applications, enabling a collection of heterogeneous computers to be used as a coherent and flexible concurrent computational resource. Scyld has developed the Scyld PVM library, specifically tailored to allow PVM to take advantage of the technologies used in Scyld ClusterWare. A PVM-aware program is one that has been written to the PVM specification and linked against the Scyld PVM library.

A complete discussion of cluster configuration for PVM is beyond the scope of this document. However, a brief introduction is provided here, with the assumption that the reader has some background knowledge on using PVM.

You can start the master PVM daemon on the master node using the PVM console, pvm. To add a compute node to the virtual machine, issue an add .# command, where # is replaced by a node's assigned number in the cluster.

You can generate a list of node numbers using either of the beosetup or bpstat commands.

Alternately, you can start the PVM console with a hostfile filename on the command line. The hostfile should contain a .# for each compute node you want as part of the virtual machine. As with standard PVM, this method automatically spawns PVM slave daemons to the specified compute nodes in the cluster. From within the PVM console, use the conf command to list your virtual machine's configuration; the output will include a separate line for each node being used. Once your virtual machine has been configured, you can run your PVM applications as you normally would.

Porting Other Parallelized Programs

Programs written for use on other types of clusters may require various levels of change to function with Scyld ClusterWare. For instance:

Scripts or programs that invoke rsh can instead call bpsh.
Scripts or programs that invoke rcp can instead call bpcp.
beomap can be used with any script to load balance programs that are to be dispatched to the compute nodes.

For more information on porting applications, see the Programmer's Guide

<< Previous	Home	Next >>
Running Programs That Aren't Parallelized	Up	Running Serial Programs in Parallel