BRC Cluster Workshop

Cheat Sheet

Submit a job

This command submits a job to run “in the background”. Output is written to a file which is named slurm-NNNNN.out by default, where NNNNN is the job number slurm assigns to your job.

sbatch [-p partition] [-c ncores] [--mem=NNNG] [--exclusive] scriptname

“–exclusive” requests all cores on a node. Use it only when you need to. “-c” specifies how many cores your job will use. (Use only one of “-c” and “–exclusive”.) “scriptname” must be the name of a shell script (but see the “–wrap” option.) “–mem=0” requests all memory on a node.

For submitting many jobs look at using the “–array” option.

sbatch --array=0-99%5 scriptname

Will run 100 copies of scriptname in total, but only allow 5 to be running at any one time. The script itself must sort out how to do something different for each instance (using the SLURM_ARRAY_TASK_ID environment variable).

On the new cluster you may need to use the “–mem” or “–mem-per-cpu” options to make sure that sufficient memory is allocated to your task. The default is 8GB per cpu/core.

Check the queue

squeue
squeue -u USERNAME
squeue -w NODENAME

Check node status

sinfo
sinfo -p standard -N -O "partition,nodelist,cpus,memory,cpusload"
use_by_user

“use_by_user” is a script that runs “scontrol” to get the information it reports.

Kill jobs

scancel JOBID
scancel -u USERNAME

Not “skill” - which does exist, but isn't part of SLURM.

The second of these commands would kill all of your slurm jobs.

Report Job Details

This works for running (or very recently completed) jobs.

scontrol show job JOBID

Check on the Resources a Job Used

This command will show how much (elapsed) time and memory a job used. This information is kept in the slurm accounting database so the command can be used long after the job has completed.

sacct -o elapsed,maxrss -j NNNN.batch

Check on resources used by all of your jobs since a specific date:

sacct --user=chris -S 2020-01-01 -o elapsed,maxrss

Get Information about a Node

To get information about a node including how many cores and how much memory it has:

scontrol show node node62

Interactive Jobs

srun [-p partition] [-c ncores] [--exclusive] program
srun --pty bash -i

The second command above will get you a command line on a node. You can use the “-w” option to target a specific node. (Note that you will only get the command line if there is a free core on the node in question.) You could use this to check on your job's status - e.g. amount of memory it is using, number of cores it is using. This can also be done more programmatically in your scripts, or using sstat (for memory use), but this command line technique can be useful sometimes.

As a matter of etiquette, please don't start up a shell on a node and just leave it running when you aren't using it. This tends to reduce the number of nodes available for exclusive use for users who need one.

Checking Disk Space

Check how much space is left on your home volume:

chris@node0:~$ cd
chris@node0:~$ pwd
/home3/chris
chris@node0:~$ df -H /home3
Filesystem                     Size  Used Avail Use% Mounted on
fs2:/srv/storage_2/node-home3  105T   91T   14T  88% /home3
chris@node0:~$

You should check this before you add a lot more data to your home directory, or run jobs that generate a lot of output. If you need more space than is available on your home volume please talk to the system administrators: we may be able to give you space on a different volume. Consider using /scratch for data that can easily be replaced (e.g. data download from NCBI).

See space remaining on all home volumes:

chris@node0:~$ df -H /home*
Filesystem                     Size  Used Avail Use% Mounted on
fs1:/srv/storage_1/node-home    40T   33T  7.5T  82% /home
fs2:/srv/storage_1/node-home   105T   96T  8.6T  92% /home2
fs2:/srv/storage_2/node-home3  105T   91T   14T  88% /home3
fs3:/srv/storage_1/node-home    81T   51T   30T  63% /home4
fs4:/srv/storage_0/node-home5  118T   39T   79T  33% /home5
fs4:/srv/storage_1/node-home6   98T     0   98T   0% /home6

To check how much disk space a directory is using:

chris@node0:~$ du -sh torch
3.6G    torch

(This can take a long time if there are many files in the directory.)