Scyld ClusterWare HPC: Administrator's Guide
<< Previous	File Systems	Next >>

NFS

NFS is the standard way to have files stored on one machine, yet be able to access them from other machines on the network as if they were stored locally.

NFS on Clusters

NFS in clusters is typically used so that if all the nodes need the same file, or set of files, they can access the file(s) through NFS. This way, if one changes the file, every node sees the change, and there is only one copy of the file that needs to be backed up.

Configuration of NFS

The Network File System (NFS) is what Scyld ClusterWare uses to allow users to access their home directories and other remote directories from compute nodes. (The User's Guide has a small discussion on good and bad ways to use NFS.) Two files control what directories are NFS mounted on the compute nodes. The first is /etc/exports. This tells the nfs daemon on the master node what directories it should allow to be mounted and who can access them. Scyld ClusterWare adds various commonly useful entries to /etc/exports. For example:

/home  @cluster(rw)

The /home says that /home can be nfs mounted, and @cluster(rw) says who can mount it and what forms of access are allowed. @cluster is a netgroup. It uses one word to represent several machines. In this case, it represents all your compute nodes. cluster is a special netgroup that is setup by BeoNSS that automatically maps to all of your compute nodes. This makes it easy to specify something can be mounted by your compute nodes. The (rw) part specifies what permissions the compute node has when it mounts /home. In this case, all user processes on the compute nodes have read-write access to /home. There are more options that can go here, and you can find them detailed in man exports.

The second file is /etc/beowulf/fstab. (Note that it is possible to set up an individual fstab.N for a node. For this discussion, we will assume that you are using a global fstab for all nodes.) For example, one line in the default /etc/beowulf/fstab is the following:

$MASTER:/home  /home  nfs  nolock,nonfatal  0 0

This is the line that tells the compute nodes to try to mount /home when they boot:

The $MASTER is a variable that will automatically be expanded to the IP of the master node.
The first /home is the directory location on the master node.
The second /home is where it should be mounted on the compute node.
The nfs specifies that this is an nfs file system.
The nolock specifies that locking should be turned off with this nfs mount. We turn off locking so that we don't have to run daemons on the compute nodes. (If you need locking, see the Section called File Locking Over NFS for details.)
The nonfatal tells ClusterWare's /usr/lib/beoboot/bin/setup_fs script to treat a mount failure as a nonfatal problem. Without this nonfatal option, any mount failure leaves the compute node in an error state, thus making it unavailable to users.
The two 0's on the end are there to make the fstab like the standard fstab in /etc.

To add an nfs mount of /foo to all your compute nodes, first add the following line to the end of the /etc/exports file:

/foo  @cluster(rw)

Then execute /usr/sbin/exportfs -a as root. For the mount to take place the next time your compute nodes reboot, you must add the following line to the end of /etc/beowulf/fstab:

$MASTER:/foo  /foo  nfs  nolock  0 0

You can then reboot all your nodes to make the nfs mount happen. If you wish to mount the new exported filesystem without rebooting the compute nodes, you can issue the following two commands:

[root @cluster ~] # bpsh -a mkdir -p /foo
[root @cluster ~] # bpsh -a mount -t nfs -o nolock master:/foo /foo

Note that /foo will need to be adjusted for the directory you actually want.

If you wish to stop mounting a certain directory on the compute nodes, you can either remove the line from /etc/beowulf/fstab or just comment it out by inserting a '#' at the beginning of the line. You can leave untouched the entry referring to the filesystem in /etc/exports, or you can delete the reference, whichever you feel more comfortable with.

If you wish to unmount that directory on all the compute nodes without rebooting them, you can then run the following:

[root @cluster ~] # bpsh -a umount /foo

where /foo is the directory you no longer wish to have NFS mounted.

On compute nodes, NFS directories must be mounted using either a specific IP address or the $MASTER keyword; the hostname cannot be used. This is because fstab is evaluated before node name resolution is available.

File Locking Over NFS

By default, the compute nodes mount NFSv3 filesystems with locking turned off. If you have a program that requires locking, first ensure that the nfslock service is enabled on the master node and is executing:

[root @cluster ~] # /sbin/chkconfig nfslock on
[root @cluster ~] # service nfslock start

Next, edit /etc/beowulf/fstab to remove the nolock keyword from the NFS mount entries.

Finally, reboot the cluster nodes to effect the NFS remounting with locking enabled.

NFSD Configuration

By default, when the master node reboots, the /etc/init.d/nfs script launches 8 NFS daemon threads to service client NFS requests. For large clusters this count may be insufficient. One symptom of an insufficiency is a syslog message, most commonly seen when you boot all the cluster nodes:

nfsd: too many open TCP sockets, consider increasing the number of nfsd threads

To increase the thread count (e.g., to 16):

[root @cluster ~] # echo 16 > /proc/fs/nfsd/threads

Ideally, the chosen thread count should be sufficient to eliminate the syslog complaints, but not significantly higher, as that would unnecessarily consume system resources. To make the new value persistent across master node reboots, create the file /etc/sysconfig/nfs, if it does not already exist, and add to it an entry of the form:

RPCNFSDCOUNT=16

A value of 1.5x to 2x the number of nodes is probably adequate, although perhaps excessive.

A more refined analysis starts with examining NFSD statistics:

[root @cluster ~] # grep th /proc/net/rpc/nfsd

which outputs thread statistics of the form:

th 16 10 26.774 5.801 0.035 0.000 0.019 0.008 0.003 0.011 0.000 0.040

From left to right, the 16 is the current number of NFSD threads, and the 10 is the number of times that all threads have been simultaneously busy. (Not all circumstances of all threads being busy results in that syslog message, but a high all-busy count does suggest that adding more threads may be beneficial.)

The remaining 10 numbers are histogram buckets that show how many accumulated seconds a percentage of the total number of threads have been simultaneously busy. In this example, 0-10% of the threads were busy 26.744 seconds, 10-20% of the threads were busy 5.801 seconds, and 90-100% of the threads were busy 0.040 seconds. High numbers at the end indicate that most or all of the threads are simultaneously busy for significant periods of time, which suggests that adding more threads may be beneficial.