Notable Feature Enhancements And Bug Fixes

New in ClusterWare 5.3.0

  1. The base kernel is upgraded to 2.6.18-128.1.1. See https://rhn.redhat.com/errata/RHSA-2009-0225.html and https://rhn.redhat.com/errata/RHSA-2009-0264.html for details. This kernel requires a base distribution of RHEL5-U3 or CentOS 5.3.

  2. Includes the RHEL5-U2 tg3 driver (version 3.86) to supercede the flawed RHEL5-U3 tg3 (version 3.93) driver that occasionally misbehaves for Broadcom NetXtreme or NetLink controllers. The misbehavior consists of initially linking at the expected 1000 Mbps/Full, then disconnecting and relinking at a much slower 10 Mbps/Half.

  3. Includes the new ClusterAdmin Web Interface, a cluster monitoring tool that leverages the Scyld ClusterWare Integrated Management Framework (IMF). See the Section called Optionally enable IMF ClusterAdmin Web Interface and the Administrator's Guide for details.

  4. Includes ENV Modules, which allow a user to easily switch between applications with a simple module switch command that resets environment variables like PATH and LD_LIBRARY_PATH. See the Programmer's Guide for details, and see the Section called Issues with ENV Modules and TORQUE for a necessary workaround.

  5. Includes a simplified method to enable cluster-wide NFS locking. See the Section called Optionally enable NFS locking and the Administrator's Guide for details.

  6. Includes compiler-specific FFTW libraries in /usr/lib64/FFTW/.

  7. OpenMPI is upgraded to 1.2.9. This new package also include per-compilier built binaries, which required relocating some OpenMPI files in order to accommodate this enhancement. Previously, there was only one set of OpenMPI binaries available, regardless of the compiler toolchain used, located in /usr/openmpi/bin/. Now the binaries reside in /usr/openmpi/bin/compilerName/, consistent with how they were built. The OpenMPI include files have also been moved from their old location of /usr/openmpi/include into the new location of /usr/include/openmpi/. The library locations have not changed.

  8. TORQUE is upgraded to version 2.3.6.

  9. Ganglia is upgraded to version 3.0.7.

  10. Fixes a bug in the ClusterWare Name Service which leaked a file descriptor in a software thread's compute node environment for each Name Service request, leading to the thread being stuck in a loop, consuming 100% of its CPU and making no forward progress. This was observed with the TORQUE pbs_mom daemon and in any other user program which issues more than 1024 Name Service requests.

  11. Fixes a bug where if the Sun Grid Engine (SGE) was configured with an admin_user other than root, then /usr/bin/qdel could not delete jobs that run on a compute node.

  12. Fixes a shortcoming in Ganglia: it was not displaying pie chart node metrics.

  13. Fixes a bug where BProc gets confused about a process' true node residency, incorrectly believing the process is executing on the master node, even though the process is in fact correctly executing on the intended compute node. This bug is due to a timing window and tends to occur (if at all) in circumstances of high rates of process creation and/or destruction. The effect of this bug is that these "bad bookkeeping" processes are outside BProc's unified process space and are thus immune to normal process management performed on the master node, e.g., /bin/kill or signal().

New in ClusterWare 5.2.0

  1. The base kernel is 2.6.18-92.1.13. See https://rhn.redhat.com/errata/RHSA-2008-0612.html for details. This kernel requires a base distribution of RHEL5-U2 or CentOS 5.2.

  2. ClusterWare 5 has dropped support for a Penguin-customized lam and has added support for a Penguin-customed OpenMPI version 1.2.8, called openmpi-scyld. Users who still require lam and who do not want to convert to openmpi-scyld can use the base distribution version of lam.

  3. RHEL5-U2 includes gcc version 4, which imposes more stringent type-checking and syntax enforcement than the earlier gcc version 3.

  4. RHEL5 includes the Open Fabric Infiniband (OFED) software stack version 1.3. OFED 1.3 (which includes DAPL 2.0) may be incompatible with MPI stacks supplied by certain ISV applications that are compatible with OFED 1.2 (which includes DAPL 1.0). No OFED 1.3 problems have been observed for applications based upon MVAPICH or OpenMPI.

  5. Support for Nvidia CUDA.

  6. Torque is upgraded to version 2.3.3.

  7. ClusterWare distributes a custom forcedeth kernel driver for Nvidia Ethernet hardware that is a backport from the linux-2.6.26.5 kernel and appears to be more reliable than the driver found in the RHEL5-U2 base distribution.