BeoMPI, ClusterWare's implementation of MPI, is based on MPICH. MPICH is a standard, portable MPI implementation developed at Argonne National Laboratories. The most common configuration for MPICH under Scyld ClusterWare is based on using a message passing library named "P4." The P4 library provides message transport over TCP/IP connections, and thus is suitable for use with Ethernet and Ethernet-like networks. (An alternative is GM, which works only with Myricom's Myrinet hardware.)
The most striking difference in how BeoMPI operates under Scyld ClusterWare versus MPICH on other clusters is how MPI tasks are started. In the basic MPICH implementation MPI programs are started using a special script "mpirun." This script takes a number of arguments that specify, among other details, how many tasks to create and where to run these tasks on the parallel machine. The script then uses some mechanism to start the tasks on each node — often utilities such as "rsh" or "ssh" are used. Thus a typical execution of an MPI program would look like:
mpirun -np 16 mpiapp inf outf |
which runs 16 tasks of the program "mpiapp" with arguments "inf" and "outf."
Scyld ClusterWare system software provides an "mpirun" script that works just like the standard script, except for two things. First, the Scyld ClusterWare script has a few options that are unique to Scyld ClusterWare.
-np <int> run <int> tasks
--all_cpus run a task on each cpu
--all_local run all tasks on master node
--no_local run no tasks on master node
--exclude <int>[:<int>] run on any nodes not listed
--map <int>[:<int>] run on nodes listed |
Second, the way MPI jobs are run under Scyld ClusterWare is semantically cleaner than on other distributed or old-generation cluster systems. By using the internal BProc mechanism, processes are initiated on remote machines faster and are directly monitored and terminated.
Rather than start independent copies of the program, which later coordinate, the Scyld ClusterWare MPI implementation starts a single program. When MPI_Init() is called, bproc_rfork() is used to create multiple copies of the program on the compute nodes. Thus, the Scyld ClusterWare implementation of MPI has semantics similar to shared memory supercomputers such as the Cray T3E, rather than semantics of a collection of independent machines. The initialization code is run only once, which allows updating configuration files or locks without the need for a distributed file system or the need for explicit file locking. MPI_Init determines how many tasks to create by checking an environment variable. Thus, the Scyld ClusterWare implementation of mpirun actually just parses the mpirun flags, sets the appropriate environment variables, and then runs the program. The environment variables recognized by MPI_Init are as follows:
NP=<int>
ALL_CPUS=Y
ALL_LOCAL=Y
NO_LOCAL=Y
EXCLUDE=<int>[:<int>]
BEOWULF_JOB_MAP=<<int>[:<int>] |
Using these environment variables, an MPI program can be run on a Scyld ClusterWare system without any special scripts or flags. Thus, if the flag "NP" has been set to the integer 16, the example listed above could be run as:
Further, turn-key applications can be developed that set the environment variables from inside the program before MPI_Init is called so that the user doesn't even have to be aware that the program runs in parallel.
A number of the flags provided under Scyld ClusterWare are useful for things like scheduling and debugging. The --all_local flag (ALL_LOCAL variable) is important for debugging programs, because it is much simpler to debug programs with all tasks running on the master node than actually running on the compute nodes. The --no_local flag (NO_LOCAL variable) is usually used to keep ALL tasks running on compute nodes and not load the master node with processes that must compete with interactive processing. The --exclude flag (EXCLUDE variable) can be used to keep tasks off of nodes that are reserved for special purposes such as I/O, visualization, debugging, etc. Also, the --map flag (BEOWULF_JOB_MAP variable) can be used to run tasks on specific nodes. This can be used by a scheduler to control where processes are run in a multi-user/multi-job environment.