Cold Re-parenting of Compute Nodes

In the event that a master node dies or becomes otherwise unresponsive, each compute nodes recognizes its master is unresponsive and it proceeds to reboot. In the previous section, each compute node was statically partitioned to one and only one master node. It can be assumed that compute nodes that are partioned to the now-unresponsive master node will fail to reboot successfully.

An alternative cluster configuration allows for "cold re-parenting" of compute nodes. The masterorder entries may declare multiple IP addresses, not just one address, as an ordered list of master nodes that may manage the specified set of compute nodes. Now when a compute node reboots, it attempts to connect to each master node in turn, until one of them responds and serves as the compute node's master.

Example 2. Configuring Compute Nodes with Two Masters

Extending the previous example, set the masterorder entries to read as follows:

	masterorder  0-19 10.1.1.1 10.1.1.2
	masterorder 20-39 10.1.1.2 10.1.1.1

If the master at 10.1.1.1 fails, then compute nodes 0-19 automatically reboot and are re-parented to the master at 10.1.1.2. If and when the master at 10.1.1.1 becomes available again, then the compute nodes can be re-parented back to their original primary master by doing bpctl -S 0-19 -R on the master at 10.1.1.2.