Failover

An important concept in computing today is failover. Failover is other nodes compensating for the loss of another node by acquiring its workload. This is commonly used for servers, such as web servers.

When Compute Nodes Fail

When a compute node fails, all jobs running on that node will fail. If there was an MPI job running that was using that node, the entire job will fail on all the nodes on which the MPI program was running.

Even though the running jobs running on that node failed, jobs running on other nodes that weren't communicating with jobs on the failed node will continue to run without a problem.

If the problem with the node is easily fixed and you want to bring the node back into the cluster, you need merely to plug it back into the cluster and turn it on. It will immediately boot and as soon as it gets to the up state, new jobs can be spawned that will use it.

If you wish to switch out the node for a new machine, you need merely go into BeoSetup, delete the node that went down, then bring the new node up and move it to the same node number as the node that went down. (See the Chapter called Configuring the Cluster with BeoSetup for more information on how to do this.) Switching out the node like this will not cause any problems for jobs that are running on other nodes.

Compute Node Data

What happens to data on a compute node after the node goes down depends on how you have the file system on the compute node setup. If you are only using a RAMdisk on your compute nodes, then all data stored on your compute node will be lost when it goes down.

If you are using the hard disk on your compute nodes, there are a few more variables to take into account. If you have your cluster configured to run mke2fs on every compute node boot, then all data that was stored on ext2 file systems on the compute nodes will be destroyed. If you do not have it set to run mke2fs on every compute node boot, then it will try to recover the ext2 file systems with fsck; however, there are no guarantees that the file system will be recoverable.

Note that even if fsck is able to recover the file system, there is a chance that files you were writing to may be in a corrupt or unstable state.