User Tools

Site Tools


running_jobs

This is an old revision of the document!


Running Jobs

You should submit all long-running, or computationally intensive jobs to one 
of the compute nodes. You can write and test your code, and make trial runs 
of programs, format data, etc., on the head node, but please don't bog it down.

In general, the command you should use to run jobs on the compute nodes is sbatch.

sbatch

sbatch will submit a script to a partition (standard by default) and exit immediately. Your script will be executed when resources are available (often immediately). Output (to stdout and stderr) from the script are redirected to a file.

Use the -p option to specify a partition.

sbatch -p bigmem huge_alignment
sbatch -p gpu fancy-gpu-code

You can also submit an “array job”. This submits the same job multiple times. You can use a SLURM-defined environment variable to distinguish between copies of the job from within the job script.

You can run programs using srun but you need to take action to prevent your program stopping on “hang-up” (loss of network connection). (We'll see more on the purpose of srun later.)

You can also run programs using salloc. This is interesting since it allows you to make an allocation of resources (nodes and cores) and then run programs within that allocation interactively using srun. It is subject to the same “hang-up” caveat as srun. By default salloc will run bash.

salloc -n 5 
srun hostname

By default when you submit a job it will be allocated one core and 8GB of RAM on a compute node. This core and memory is then unavailable to other cluster users. The cluster currently has about 900 cores in total.

Files: ex1.bash

To alter your allocation you can use the -c, -n, -N, and –exclusive options (the long versions of these options are: –cpus-per-task, –ntasks, –nodes).


-c, --cpus-per-task

cpus==cores in our configuration.

This option sets how many cores each of your tasks will use. (By default you are running one task.) Since the nodes in the standard partition have a maximum of 24 cores you can't get an allocation of more than 24 cores per task (in that partition). The job would be rejected

To get an entire node to yourself, regardless of the number of cores on the node, you can use the –exclusive option.

You should use the -c option if you are running a job that allows you to specify how many threads to run (or cores to use). So, if you have an option on the program you are running to say “use 8 cores” you should also tell SLURM that your program is using 8 cores. If you don't the node you are running on may get a lot more threads being run on it than the usual limit of 16 (or 24) jobs allocated to it and could get bogged down.


-n, --ntasks

This option is useful in very specific circumstances, and is not typically what users of the BRC cluster want. Avoid it unless you are sure it is what you want!

This option specifies the number of tasks you will be running (maximum). This can spread your allocation across multiple nodes: -n 40 must use more than one node at one core per task (in the standard partition).

It behaves a little differently between sbatch and srun.

For sbatch it will allocate the number of cores specified, possibly across multiple nodes, and run your script on the first of those nodes. It is then up to that script to make use of the other allocated cores by using srun from within itself. When you use srun from within the script, it will start one copy of the srun task on each of the allocated cores i.e. it will run multiple copies of the same task for each srun command.

For srun, it will cause one copy of your task to be run on each allocated core i.e. it will run multiple copies of the same task for the one srun command.


-N, --nodes

This option has the same caveats as “-n” (see above): it behaves differently between sbatch and srun, and may well not be what you want.

This option specifies how many nodes you want your tasks to run on. It does not allocate whole nodes to you. Using -N 2 with none of the other relevant options would give you 1 core on each of two different nodes.


--exclusive

Request all the cores on a node. This will set the allocation of cores on that node to however many cores there are in total on the node. Since you likely want all the memory on the node as well, you should likely specify “–mem=0” along with “–exclusive”.


sbatch
srun
salloc

When using sbatch, if you want to see what output your job is writing you can either edit or tail the output file, or you can (sometimes) use sattach: you can only sattach to a “job step”. A “job step” is created when you use srun from within a script executed by sbatch.

sattach

running_jobs.1634914952.txt.gz · Last modified: 2021/10/22 11:02 by root