User Tools

Site Tools


running_jobs

Running Jobs

You should submit all long-running, or computationally intensive jobs to the compute nodes. 
You can write and test your code, and make trial runs of programs, format data, etc., 
on the head node, but please don't bog it down.

In general, the command you should use to run jobs on the compute nodes is sbatch.

sbatch

sbatch will submit a script to a partition (standard by default) and exit immediately. Your script will be executed when resources are available (often immediately). Output (to stdout and stderr) from the script are redirected to a file.

Use the -p option to specify a partition.

sbatch -p bigmem huge_alignment
sbatch -p gpu fancy-gpu-code

You can also submit an “array job”. This submits the same job multiple times. You can use a SLURM-defined environment variable to distinguish between copies of the job from within the job script.

You can run programs using srun but you need to take action to prevent your program stopping on “hang-up” (loss of network connection).

By default when you submit a job it will be allocated one core and 8GB of RAM on a compute node. This core and memory is then unavailable to other cluster users. The cluster currently has about 1000 cores in total. Compute nodes have at least 8GB of memory per core.

Files: ex1.bash

To alter your allocation you can use the -c, -n, -N, and –exclusive options (the long versions of these options are: –cpus-per-task, –ntasks, –nodes). But please see the notes about the -n and -N options below - they are probably not what you want.


-c, --cpus-per-task

cpus==cores in our configuration.

This option sets how many cores each of your tasks will use. (By default you are running one task.) Since the nodes in the standard partition have a maximum of 32 cores you can't get an allocation of more than 32 cores per task (in that partition). The job would be rejected

To get an entire node to yourself, regardless of the number of cores on the node, you can use the –exclusive option.

You should use the -c option if you are running a job that allows you to specify how many threads to run (or cores to use). So, if you have an option on the program you are running to say “use 8 cores” you should also tell SLURM that your program is using 8 cores. If you don't do this, your job will only be allowed to use 1 core (and the threads started by your program will be time-sliced on that one core, probably making it run much slower than you expected).


--exclusive

Request all the cores on a node. This will set the allocation of cores on that node to however many cores there are in total on the node. Since you likely want all the memory on the node as well, you should likely specify “–mem=0” along with “–exclusive”.


--mem

Use the –mem option to request memory for your job. e.g. –mem=25G will request 25GB of RAM. You can, and should, request less than your default allocation of 8GB if you don't need 8GB. This frees up that memory for other users.

“–mem=0” requests all memory on a node. The actual amount allocated will depend on the node: our compute nodes have a minimum of 128GB of RAM, and at least 8GB per core.

-n, --ntasks

This option is useful in very specific circumstances, and is not typically what users of the BRC cluster want. Avoid it unless you are sure it is what you want!

This option specifies the number of tasks you will be running (maximum). This can spread your allocation across multiple nodes: -n 40 must use more than one node at one core per task (in the standard partition).

It behaves a little differently between sbatch and srun.

For sbatch it will allocate the number of cores specified, possibly across multiple nodes, and run your script on the first of those nodes. It is then up to that script to make use of the other allocated cores by using srun from within itself. When you use srun from within the script, it will start one copy of the srun task on each of the allocated cores i.e. it will run multiple copies of the same task for each srun command.

For srun, it will cause one copy of your task to be run on each allocated core i.e. it will run multiple copies of the same task for the one srun command.


-N, --nodes

This option has the same caveats as “-n” (see above): it behaves differently between sbatch and srun, and may well not be what you want.

This option specifies how many nodes you want your tasks to run on. It does not allocate whole nodes to you. Using -N 2 with none of the other relevant options would give you 1 core on each of two different nodes.

But, for sbatch, it will be up to your task to make sure that core is actually used. For srun, your task will be run as many times as nodes you specified: it will be up to you to make sure your tasks don't all do exactly the same thing.


sbatch
srun
salloc

When using sbatch, if you want to see what output your job is writing you can either edit or tail the output file, or you can (sometimes) use sattach: you can only sattach to a “job step”. A “job step” is created when you use srun from within a script executed by sbatch.

sattach

running_jobs.txt · Last modified: 2021/10/22 12:46 by root