Scyld ClusterWare HPC: Installation Guide
<< Previous		Next >>

Troubleshooting ClusterWare

Failing PXE Network Boot

If a compute node fails to join the cluster when booted via PXE network boot, there are several places to look, as discussed below.

Rule out physical problems. Check for disconnected Ethernet cables, malfunctioning network equipment, etc.

Check the compute node's system log. There are several ways to do this:

Open the BeoSetup window, select the node in question, and right-click to access the pop-up menu. Select View Syslog to see the master node's /var/log/messages file, filtered for messages about the selected node. Alternatively, select View BeoBoot Log from the pop-up menu to view the selected node N's boot/status log, /var/log/beowulf/node.N.
Run the standard Linux System Logs tool by selecting the System Tools -> System Logs from the desktop menu to open the System Logs window. Select the System Log from the list of logs in the left panel, then scroll near the end to see errors that may have been reported while the node was booting.
View /var/logs/messages. Viewing with an editor provides a static display of the error messages. Other methods for accessing the system logs may update the display in real time as messages are written to the log.

The advantage of using BeoSetup's View Syslog option is that it extracts all the node-specific information from the syslog into a single view.

Check for the correct DHCP server. If a node fails to appear initially (on power-up), subsequently disappears, or fails to appear in either the Configured Nodes or Unknown panels of the BeoSetup window, then the node may be unable to find the master node's DHCP server. Another DHCP server may be answering and supplying IP addresses.

To check whether the master is seeing the compute node's DHCP requests, or whether another server is answering, use the Linux tcpdump utility. The following example shows a correct dialog between compute node 0 (10.10.100.100) and the master node.

[root@cluster ~]# tcpdump -i eth1 -c 10
Listening on eth1, link-type EN10MB (Ethernet), 
		capture size 96 bytes
18:22:07.901571 IP master.bootpc > 255.255.255.255.bootps: 
		BOOTP/DHCP, Request from .0, length: 548
18:22:07.902579 IP .-1.bootps > 255.255.255.255.bootpc: 
		BOOTP/DHCP, Reply, length: 430
18:22:09.974536 IP master.bootpc > 255.255.255.255.bootps: 
		BOOTP/DHCP, Request from .0, length: 548
18:22:09.974882 IP .-1.bootps > 255.255.255.255.bootpc: 
		BOOTP/DHCP, Reply, length: 430
18:22:09.977268 arp who-has .-1 tell 10.10.100.100
18:22:09.977285 arp reply .-1 is-at 00:0c:29:3b:4e:50
18:22:09.977565 IP 10.10.100.100.2070 > .-1.tftp:  32 RRQ 
		"bootimg::loader" octet tsize 0
18:22:09.978299 IP .-1.32772 > 10.10.100.100.2070: 
		UDP, length 14
10 packets captured
32 packets received by filter
0 packets dropped by kernel

Check the network interface. Verify that the master node's network interface is properly set up. From the BeoSetup window, click the Configuration button to open the Cluster Configuration window. Then check the network interface settings in the Network Properties tab of this window. Then start or reconfigure cluster services again.

Figure 1. BeoSetup Network Properties

Verify that ClusterWare services are running. Choose System Settings -> Server Settings -> Services from the desktop menu to open the Service Configuration applet. Make sure that the beowulf checkbox is checked in the left panel of this window.

Alternatively, you can check the status of ClusterWare services by entering the following command in a terminal window:

[root@cluster ~]# service beowulf status

To restart ClusterWare services, click the beowulf entry to highlight it, and then click the Restart button in the icon bar. Be sure to click the Save button before exiting the applet.

You can also restart ClusterWare services from the BeoSetup window by choosing File -> Start Cluster or File -> Service Reconfigure from the menu options at the top of the window. Then power-cycle a compute node to see if it now joins the cluster.

Alternatively, restart ClusterWare services from the command line using either this command:

[root@cluster ~]# /etc/init.d/beowulf restart

or this command:

[root@cluster ~]# service beowulf restart

Figure 2. Service Configuration Applet

Check the switch configuration. If the compute nodes fail to boot immediately on power-up but successfully boot later, the problem may lie with the configuration of a managed switch.

Some Ethernet switches delay forwarding packets for approximately one minute after link is established, attempting to verify that no network loop has been created ("spanning tree"). This delay is longer than the PXE boot timeout on some servers.

Disable the spanning tree check on the switch; the parameter is typically named "fast link enable".

<< Previous	Home	Next >>
mpi-mandel		Mixed Uni-Processor and SMP Cluster Nodes