Job Batching

Job Batching Options for ClusterWare

For Scyld ClusterWare HPC, the default installation includes the TORQUE resource manager, providing users an intuitive interface for for remotely initiating and managing batch jobs on distributed compute nodes. TORQUE is an open source tool based on standard OpenPBS. Basic instructions for using TORQUE are provided in the next section. For more general product information, see the TORQUE information page sponsored by Cluster Resources, Inc. (CRI). (Note that TORQUE is not included in the default installation of Scyld Beowulf Series 30 at this time.)

Scyld also offers the Scyld TaskMaster Suite for clusters running Scyld Beowulf Series 30, Scyld ClusterWare HPC, and upgrades to these products. TaskMaster is a Scyld-branded and supported commercial scheduler and resource manager, developed jointly with Cluster Resources. For information on TaskMaster, see the Scyld TaskMaster Suite page in the HPC Clustering area of the Penguin web site, or contact Scyld Customer Support.

In addition, Scyld provides support for most popular open source and commercial schedulers and resource managers, including SGE, LSF, PBSPro, Maui and MOAB. For the latest information, see the Scyld MasterLink™ support site.

Job Batching with TORQUE

The default installation is configured as a simple job serializer with a single queue named batch.

You can use the TORQUE resource manager to run jobs, check job status, find out which nodes are running your job, and find job output.

Running a Job

To run a job with TORQUE, you can put the commands you would normally use into a job script, and then submit the job script to the cluster using qsub. The qsub program has a number of options that may be supplied on the command line or as special directives inside the job script. For the most part, these options should behave exactly the same in a job script or via the command line, but job scripts make it easier to manage your actions and their results.

Following are some examples of running a job using qsub. For more detailed information on qsub, see the qsub man page.

Example 9. Starting a Job with a Job Script Using One Node

The following script declares a job with the name "myjob", to be run using one node. The script uses the PBS -N directive, launches the job, and finally sends the current date and working directory to standard output.


## Set the job name
#PBS -N myjob
#PBS -l nodes=1

# Run my job

echo Date:  $<date>
echo Dir:  $PWD

You would submit "myjob" as follows:

[bjosh@iceberg]$ qsub -l nodes=1 myjob

Example 10. Starting a Job from the Command Line

This example provides the command line equivalent of the job run in the example above. We enter all of the qsub options on the initial command line. Then qsub reads the job commands line-by-line until we type ^D, the end-of-file character. At that point, qsub queues the job and returns the Job ID.

[bjosh@iceberg]$ qsub -N myjob -l nodes=1:ppn=1 -j oe
echo Date:  $<date>
echo Dir:  $PWD

Example 11. Starting an MPI Job with a Job Script

The following script declares an MPI job named "mpijob". The script uses the PBS -N directive, prints out the nodes that will run the job, launches the job using mpiexec, and finally prints out the current date and working directory. When submitting MPI jobs using TORQUE, it is recommended to simply call mpirun without any arguments. mpirun will detect that it is being launched from within TORQUE and assure that the job will be properly started on the nodes TORQUE has assigned to the job. In this case, TORQUE will properly manage and track resources used by the job.

## Set the job name
#PBS -N mpijob

# RUN my job
mpirun /path/to/mpijob

echo Date:  $<date>
echo Dir:  $PWD

To request 8 total processors to run "mpijob", you would submit the job as follows:

[bjosh@iceberg]$ qsub -l nodes=8 mpijob

To request 8 total processors, using 4 nodes, each with 2 processors per node, you would submit the job as follows:

[bjosh@iceberg]$ qsub -l nodes=4:ppn=2 mpijob

Checking Job Status

You can check the status of your job using qstat. The command line option qstat -n will display the status of queued jobs. To watch the progression of events, use the watch command to execute qstat -n every 2 seconds by default; type [CTRL]-C to interrupt watch when needed.

Example 12. Checking Job Status

This example shows how to check the status of the job named "myjob", which we ran on 1 node in the first example above, using the option to watch the progression of events.

[bjosh@iceberg]$ qsub myjob && watch qstat -n

JobID	Username	Queue	Jobname	SessID	NDS	TSK	ReqdMemory	ReqdTime	S	ElapTime
15.iceberg	bjosh	default	myjob	--	1	--	--	00:01	Q	--

Table 1. Useful Job Status Commands

ps -ef | bpstat -PDisplay all running jobs, with node number for each
qstat -QDisplay status of all queues
qstat -nDisplay status of queued jobs
qstat -f JOBIDDisplay very detailed information about Job ID
pbsnodes -aDisplay status of all nodes

Finding Out Which Nodes Are Running a Job

To find out which nodes are running your job, use the following commands:

  • To find your Job Ids: qstat -an

  • To find the Process IDs of your jobs: qstat -f <jobid>

  • To find the number of the node running your job: ps -ef | bpstat -P | grep <yourname>

    The number of the node running your job will be displayed in the first column of output.

Finding Job Output

When your job terminates, TORQUE will store its output and error streams in files in the script's working directory.

  • Default output file: <jobname>.o<jobid>

    You can override the default using qsub with the -o <path> option on the command line, or use the #PBS -o <path> directive in your job script.

  • Default error file: <jobname>.e<jobid>

    You can override the default using qsub with the -e <path> option on the command line, or use the #PBS -e <path> directive in your job script.

  • To join the output and error streams into a single file, use qsub with the -j oe option on the command line, or use the #PBS -j oe directive in your job script.