Scyld ClusterWare HPC: Programmer's Guide
<< Previous		Next >>

Introduction

Scyld ClusterWare system software has been designed to make it easier and more intuitive to write, debug, and tune cluster applications. On the surface, programming a Scyld ClusterWare computer is much like programming any other message passing computer system — the programmer uses a message passing library such as MPI, PVM or directly using TCP/IP sockets to pass data between processes running on different nodes. The resulting programs may be run interactively, using the Scyld BBQ queuing system, or submitted using site-wide job schedulers like PBS or LSF. Programs are debugged using tools such as gdb or TotalView. While the Scyld ClusterWare programming environment utilizes all of these software components, there is a subtle difference in the machine model that affects how these components are used and makes program development activities flow more naturally.

Two fundamental concepts underlying Scyld ClusterWare software design are providing the ability to deploy packaged applications, and making those applications easy for end users to use. The Scyld ClusterWare system provides new interfaces that allow an application to dynamically link against the communication library that matches the underlying communication hardware, determine the cluster size and structure, query a user-provided process-to-node mapping function and present the resulting set of processes as a single locally-controlled job. All of these interfaces are designed to allow creating effective applications that run over a cluster, without the end user needing to know the details of the machine, or even that the system is composed of multiple independent machines.

One subsystem that the Scyld ClusterWare system software uses to provide this functionality is a unified process space. This is implemented with an in-kernel mechanism named BProc, the Beowulf Process space extension. In the simplest case, BProc is a mechanism for starting processes on remote nodes, much like rsh or ssh. The semantics go considerably beyond merely starting remote processes: BProc implements a global Process ID (PID) and remote signaling. Using global PID space over the cluster allows users on the master node to monitor and control remote processes exactly as if they were running locally.

A second subsystem, closely related to BProc, is VMA-migrate. This mechanism is used by programmers to create processes on remote machines, controlled by the BProc global PID space. This mechanism is invisible to end users, and has a familiar programming interface. Creating remote processes is done with new variants of the fork() and exec() system calls. Controlling those processes is done with the existing process control system calls such as kill() and wait(), with the semantics unchanged from the local process POSIX semantics. An additional system call is bproc_move(), which provides the ability to migrate the entire set of Virtual Memory Areas (VMA) of a process to a remote machine, thus the name VMA-migrate.

The VMA-migrate system is implemented with a transparent system-wide mechanism for library and executable management. This subsystem isolates requests for libraries and executables from other filesystem transactions. The Scyld ClusterWare system takes advantage of the special semantics of libraries and executables to implement a highly effective whole-file-caching system invisible to applications and end users. The end result is a highly efficient process creation and management system that requires little explicit configuration by programmers or administrators.

A third subsystem is the integrated scheduling and process mapping subsystem, named Beomap. This subsystem allows an application to request a set of remote nodes available for use. Either the system administrator or end user may provide alternate scheduling functions. Those functions typically use information from the BeoStat subsystem or a system network topology table to suggest an installation-specific mapping of processes to nodes.

In this document we detail the important interfaces, services, and mechanisms made available to the programmer by the Scyld ClusterWare system software. In many cases the relevant interfaces are the same as on any standard parallel computing platform. For example, the MPI communication interface provided by BeoMPI on Scyld ClusterWare is identical to any cluster using MPICH. In these cases, this document will refer to relevant documentation for the details. In other cases the interface may be the same, yet there are differences in behavior. For instance MPI programs run on a single processor until they call MPI_Init(), allowing the application to modify or specify how it should be mapped over the cluster.

The interfaces discussed in this document include those that might be pertinent to application programs in addition to those that might be used by system programmers developing servers or environments. In all cases, it is assumed that the readers of this document are intent on developing programs. Most information of interest to more casual users or administrators may be found in the User's Guide or the Administrator's Guide.

Software Design Cycle

The design cycle for developing parallel software is similar to most other software development environments. First a program is designed, then coded using some kind of text editor, then compiled and linked, and finally executed. Debugging consists of iterating through this cycle.

Developing parallel code on a Scyld ClusterWare computer is very similar. All of these activities, except execution, are typically performed on a single machine, just as in traditional software development. Execution occurs on multiple nodes, and thus is somewhat different. However using the Scyld ClusterWare system the execution of the program is managed from a single machine, so this distinction is less dramatic than with earlier cluster system designs.

Where the development cycle is different for a parallel machine is in many of the details of the basic development steps. In particular, the programmer must design the program for parallel execution, must code the program using libraries for parallel processing, must compile and link against these libraries, and then must run the program as a group of processes on multiple nodes. Debugging a parallel program is complicated by the fact that there are multiple processes running as part of the program. Thus, the things a programmer does to debug code — such as inserting a print statement into the code — must take into account that those things may affect all of the processes. Thus, this document focuses on issues related to the design of parallel programs, the use of parallel programming libraries, compiling and linking, and debugging.

Another important activity in developing software for parallel computers is porting code from other parallel systems to Scyld ClusterWare. The few differences in the Scyld ClusterWare system are usually the result of cleaner semantics or additional features. This document lists known issues so that porting software is an easier task.

<< Previous	Home	Next >>
Preface		BProc Overview