Scyld ClusterWare HPC: Administrator's Guide
<< Previous	Failover	Next >>

When Master Nodes Fail

When a master node fails, all jobs running on compute nodes controlled by the failing master will also fail. The bpslave daemon running on compute nodes will time-out its connection to the failed master. The compute node will then do a cold-reboot cycle and attempt a PXE-boot. If there is only one master in the system and it has failed to reboot and restart beowulf service in time, the compute nodes will time out their PXE-boot requests, and go on to the next boot-method specified in the BIOS of the compute node.

Currently, Scyld only offers a "Cold Re-parent" multi-master system, in which another master (whether a primary or secondary) is configured as a "fail-over master" for a known set of nodes. See the Section called Cold Re-parenting of Compute Nodes in the Chapter called Supporting Multiple Master Nodes for details.

<< Previous	Home	Next >>
Failover	Up	Protecting an Application from Node Failure