Scyld ClusterWare HPC: Administrator's Guide | ||
---|---|---|
<< Previous | Failover | Next >> |
When a master node fails, all jobs running on compute nodes controlled by the failing master will also fail. The bpslave daemon running on compute nodes will time-out its connection to the failed master. The compute node will then do a cold-reboot cycle and attempt a PXE-boot. If there is only one master in the system and it has failed to reboot and restart beowulf service in time, the compute nodes will time out their PXE-boot requests, and go on to the next boot-method specified in the BIOS of the compute node.
Currently, Scyld only offers a "Cold Re-parent" multi-master system, in which another master (whether a primary or secondary) is configured as a "fail-over master" for a known set of nodes. See the Section called Cold Re-parenting of Compute Nodes in the Chapter called Supporting Multiple Master Nodes for details.
<< Previous | Home | Next >> |
Failover | Up | Protecting an Application from Node Failure |