For Scyld ClusterWare HPC, the default installation includes the TORQUE resource manager, providing users an intuitive interface for for remotely initiating and managing batch jobs on distributed compute nodes. TORQUE is an open source tool based on standard OpenPBS. Basic instructions for using TORQUE are provided in the next section. For more general product information, see the TORQUE information page sponsored by Cluster Resources, Inc. (CRI). (Note that TORQUE is not included in the default installation of Scyld Beowulf Series 30 at this time.)
Scyld also offers the Scyld TaskMaster Suite for clusters running Scyld Beowulf Series 30, Scyld ClusterWare HPC, and upgrades to these products. TaskMaster is a Scyld-branded and supported commercial scheduler and resource manager, developed jointly with Cluster Resources. For information on TaskMaster, see the Scyld TaskMaster Suite page in the HPC Clustering area of the Penguin web site, or contact Scyld Customer Support.
In addition, Scyld provides support for most popular open source and commercial schedulers and resource managers, including SGE, LSF, PBSPro, Maui and MOAB. For the latest information, see the Scyld MasterLink™ support site.
The default installation is configured as a simple job serializer with a single queue named batch.
You can use the TORQUE resource manager to run jobs, check job status, find out which nodes are running your job, and find job output.
To run a job with TORQUE, you can put the commands you would normally use into a job script, and then submit the job script to the cluster using qsub. The qsub program has a number of options that may be supplied on the command line or as special directives inside the job script. For the most part, these options should behave exactly the same in a job script or via the command line, but job scripts make it easier to manage your actions and their results.
Following are some examples of running a job using qsub. For more detailed information on qsub, see the qsub man page.
Example 9. Starting a Job with a Job Script Using One Node
The following script declares a job with the name "myjob", to be run using one node. The script uses the PBS -N directive, launches the job, and finally sends the current date and working directory to standard output.
#!/bin/sh ## Set the job name #PBS -N myjob #PBS -l nodes=1 # Run my job /path/to/myjob echo Date: $<date> echo Dir: $PWD
You would submit "myjob" as follows:
[bjosh@iceberg]$ qsub -l nodes=1 myjob 15.iceberg
Example 10. Starting a Job from the Command Line
This example provides the command line equivalent of the job run in the example above. We enter all of the qsub options on the initial command line. Then qsub reads the job commands line-by-line until we type ^D, the end-of-file character. At that point, qsub queues the job and returns the Job ID.
[bjosh@iceberg]$ qsub -N myjob -l nodes=1:ppn=1 -j oe cd $PBS_0_WORKDIR echo Date: $<date> echo Dir: $PWD ^D 16.iceberg
Example 11. Starting an MPI Job with a Job Script
The following script declares an MPI job named "mpijob". The script uses the PBS -N directive, prints out the nodes that will run the job, launches the job using mpiexec, and finally prints out the current date and working directory. When submitting MPI jobs using TORQUE, it is recommended to simply call mpirun without any arguments. mpirun will detect that it is being launched from within TORQUE and assure that the job will be properly started on the nodes TORQUE has assigned to the job. In this case, TORQUE will properly manage and track resources used by the job.
## Set the job name #PBS -N mpijob # RUN my job mpirun /path/to/mpijob echo Date: $<date> echo Dir: $PWD
To request 8 total processors to run "mpijob", you would submit the job as follows:
[bjosh@iceberg]$ qsub -l nodes=8 mpijob 17.iceberg
To request 8 total processors, using 4 nodes, each with 2 processors per node, you would submit the job as follows:
[bjosh@iceberg]$ qsub -l nodes=4:ppn=2 mpijob 18.iceberg
You can check the status of your job using qstat. The command line option qstat -n will display the status of queued jobs. To watch the progression of events, use the watch command to execute qstat -n every 2 seconds by default; type [CTRL]-C to interrupt watch when needed.
Example 12. Checking Job Status
This example shows how to check the status of the job named "myjob", which we ran on 1 node in the first example above, using the option to watch the progression of events.
[bjosh@iceberg]$ qsub myjob && watch qstat -n iceberg: JobID Username Queue Jobname SessID NDS TSK ReqdMemory ReqdTime S ElapTime 15.iceberg bjosh default myjob -- 1 -- -- 00:01 Q --
To find out which nodes are running your job, use the following commands:
To find your Job Ids: qstat -an
To find the Process IDs of your jobs: qstat -f <jobid>
To find the number of the node running your job: ps -ef | bpstat -P | grep <yourname>
The number of the node running your job will be displayed in the first column of output.
When your job terminates, TORQUE will store its output and error streams in files in the script's working directory.
Default output file: <jobname>.o<jobid>
You can override the default using qsub with the -o <path> option on the command line, or use the #PBS -o <path> directive in your job script.
Default error file: <jobname>.e<jobid>
You can override the default using qsub with the -e <path> option on the command line, or use the #PBS -e <path> directive in your job script.
To join the output and error streams into a single file, use qsub with the -j oe option on the command line, or use the #PBS -j oe directive in your job script.