This is an old revision of the document!

Scheduling

The cluster is running SLURM's scheduler with:

“Fair share” job priorities.
Simple round-robin node selection.

The “Fair Share” algorithm assigns a priority to submitted jobs which is inversely dependent on the amount of cluster CPU time consumed by the user submitting the jobs in recent days. These priorities apply only to jobs waiting in the queue, they do not affect jobs which are already running.

The scheduler does not check which nodes are busy and try to avoid them. This has an advantage in that it tends to leave some nodes empty for people who need a whole node.

Limiting the Number of Nodes Your Jobs Use

If you submit many jobs to the cluster (perhaps using individual sbatch commands) they will use any of the nodes/cores available on the cluster (in the partition to which you submitted your job). If you submit, say, 1000 jobs you may well fill all available cores on the cluster and leave no resources for other users. So it is a good idea to limit the number of jobs that are running at any one time. An array job is an ideal way of doing this: it makes only a single entry in the job queue (rather than 1000 in our example) and it lets you specify the maximum number that should be run at any one time.

Array jobs are described in more detail on the sbatch page.

There are some other options that may be useful described below.

Sending a Job to a Specific Node

You can use the -w option to select a specific node. Actually it asks for “at least” the nodes in the node list you specify. So a command like:

sbatch -n 20 -w node2 my_script

Would get you some cores on node2 and some on another node (since there are only 16 cores total on node2). If there were no cores free on node2 the job would be queued until some became available.

Note that using the -w option with multiple nodes is not a way of queueing jobs on just those nodes: it will actually allocate cores across all nodes you specify and run the job on just the first on them. e.g.

sbatch -w node[2-4] my_script

would allocate one core on each of nodes 2,3, and 4 and run my_script on node2. You would then use srun from within your script to run job steps within this allocation of cores. Don't use this to limit the nodes you want your jobs to run on.

Excluding Some Nodes

You can use the -x option to avoid specific nodes. A list of node names looks like this:

node[1-4,7,11]

Read as “nodes 1 to 4, 7 and 11” i.e. 1,2,3,4,7,11.

You can use -c 16 to request all cores on a (standard) node.

You can use the –exclusive option to ask for exclusive access to all the nodes your job is allocated. This is especially useful if you have a program which attempts to use all the cores it finds. Please only use it if you need it.

You can also use an array job to limit the number of sub-tasks running at any one time. This is a very good choice, and is highly recommended, if you are submitting a lot of jobs. An array job occupies just one slot in the “pending jobs” section of the output of squeue.

Using Job Steps

If you use srun within an sbatch script, the cores to be used for the jobs being srun can be sub-allocated from the cores alloted to the sbatch script. For example, the following script, test-multi.sh, that is to be submitted using sbatch specifies 3 tasks (defaulted to one core each).

#!/bin/bash
#SBATCH --ntasks 3

for i in {1..7}; do
srun --ntasks 1 --exclusive pause.sh &
done

wait

Note that the “–exclusive” flag to srun here has a different meaning than when submitting a job from a terminal using srun or sbatch. From “man srun”

This option can also be used when initiating more than one job step within an existing resource allocation, where you want separate processors to be dedicated to each job step. If  sufficient  processors are not available to initiate the job step, it will be deferred. This can be thought of as providing a mechanism for resource management to the job within it's allocation.

The “&” at the end of the srun line is important - else each srun will block, causing the srun steps to be executed consecutively.

Similarly the “wait” at the end of the script is important - it stops the script from exiting before all the (now background) srun jobs have finished. Without the wait some of the jobs would likely still be running at the end of the script and would be terminated when their parent script ends.

Some example code for pause.sh could be as follows.

#!/bin/bash                                                                                                                                                                                                  
echo START $1 $(hostname) SLURM_STEP_ID = $SLURM_STEP_ID SLURM_PROCID = $SLURM_PROCID
##env                                                                                                                                                                                                        
sleep 60
echo END $1 $(hostname) SLURM_STEP_ID = $SLURM_STEP_ID SLURM_PROCID = $SLURM_PROCID

Now, if you run:

sbatch test-multi.sh

BRC Cluster Workshop

User Tools

Site Tools

Scheduling

Limiting the Number of Nodes Your Jobs Use

Sending a Job to a Specific Node

Excluding Some Nodes

Using Job Steps

Page Tools