Porting Applications to Scyld ClusterWare

Porting applications to Scyld ClusterWare is generally straight-forward. Scyld ClusterWare is based on Linux and provides the same set of APIs and services as typical Linux distributions, along with enhanced implementations of cluster-specific tools such as MPI. There are a number of areas where Scyld ClusterWare is different, or more precisely where certain services aren't provided or respond in an unexpected way. Of course, most applications for Scyld ClusterWare will already need to be parallelized using MPI, PVM, or something similar. This chapter does not deal with parallelization issues, but rather places where the standard operating system environment might not be the same as on other Linux systems.

The first concern is system and library calls that are either supported differently or not at all under Scyld ClusterWare. The master node is an extended standard Linux installation, thus all non-parallel applications work as expected on the master node. On compute nodes, however, several things are different. In general, compute nodes do not have local databases. Thus the standard "lookup'' name services may be configured differently. In particular, some of the name services that use static local files such as gethostbyname(), getprotobyname(), and getrpcbyname() do not have local databases by default. Under Scyld ClusterWare, all compute nodes are numbered from 0 to N and are identified by their number and a leading "dot.'' Thus, node 4 is ".4'' and node 12 is ".12'' and so on. The bproc library provides calls to look up the current host number and to obtain addresses for other nodes. On compute nodes, host name related lookups are implemented as such by bproc, thus gethostname() will typically return the node number and gethostbyname() can be used to get the IP address of a particular node. Note however that none of the compute nodes will be aware of the master node, its host name, or its network name. Generally, these calls should not be used.

Another major difference lies in the fact that compute nodes do not typically run many of the standard services. This includes services such as sendmail, ftpd, telnetd, rexecd, and rshd. Thus, the common means of running a program on a compute node using rsh does not work with a compute node without explicitly starting the 'rexec' daemon. Nor can one get a shell on a compute node using rlogin or telnet. Rather, bpsh provides a comparable means of accessing nodes in the cluster, including running remote programs.

If desired, a user can start a shell on a compute node via bpsh, but the default compute node configuration does not have access to utility programs such as cp, mv, or even ls. Thus, executing a shell on a compute node is not particularly useful. This is especially problematic when a shell executing on a compute node tries to execute a shell script. Again, unless the script has access to the binaries needed to run the script, it will fail. Thus one properly runs scripts on the master node and uses bpsh to run programs on the remote nodes.

In some ways, Scyld ClusterWare works pretty much like standard Linux, but there are details that might not be an issue with a standard system that must be considered with a Scyld system. For example, standard output will not automatically flush as often as on a native system. Thus, interactive programs need to call fflush(stdout) to ensure that prompts are actually written to the screen. License keys may also present an issue when running on a Scyld ClusterWare system. Depending on the details of the license server your software uses, problems may exist in distributing and verifying keys for the compute nodes.

In summary, there are a number of issues that need to be considered when porting an application to Scyld ClusterWare. Among these are:

In general, the best solution to these issues is to have code on the master node do lookups and pass the results to compute nodes, use bpsh for running programs on nodes, and consider issues such as scripts, standard output, and licenses.