User Tools

Site Tools


running_jobs

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
running_jobs [2021/01/29 08:48]
root
running_jobs [2021/10/22 12:46] (current)
root
Line 2: Line 2:
  
 <code> <code>
-You should submit all long-running, or computationally intensive jobs to one  +You should submit all long-running, or computationally intensive jobs to the compute nodes.  
-of the compute nodes. You can write and test your code, and make trial runs  +You can write and test your code, and make trial runs of programs, format data, etc.,  
-of programs, format data, etc., on the head node, but please don't bog it down.+on the head node, but please don't bog it down.
 </code> </code>
  
Line 25: Line 25:
 You can also submit an "array job". This submits the same job multiple times. You can use a SLURM-defined environment variable to distinguish between copies of the job from within the job script.  You can also submit an "array job". This submits the same job multiple times. You can use a SLURM-defined environment variable to distinguish between copies of the job from within the job script. 
  
-You can run programs using **srun** but you need to take action to prevent your program stopping on "hang-up" (loss of network connection). (We'll see more on the purpose of **srun** later.)+You can run programs using **srun** but you need to take action to prevent your program stopping on "hang-up" (loss of network connection).
  
-You can also run programs using **salloc**. This is interesting since it allows you to make an allocation of resources (nodes and cores) and then run programs within that allocation interactively using **srun**. It is subject to the same "hang-up" caveat as **srun**. By default **salloc** will run bash. +By default when you submit a job it will be allocated one core and 8GB of RAM on a compute node. This core and memory is then unavailable to other cluster users. The cluster currently has about 1000 cores in total. Compute nodes have at least 8GB of memory per core.
- +
-<code> +
-salloc -n 5  +
-srun hostname +
-</code> +
- +
-By default when you submit a job it will be allocated one core on a compute node. This core is then unavailable to other cluster users. The cluster currently has 668 cores in total.+
  
 **Files:** ex1.bash **Files:** ex1.bash
  
-To alter your allocation you can use the **-c**, **-n**, **-N**, and **--exclusive** options (the long versions of these options are: --cpus-per-task, --ntasks, --nodes).+To alter your allocation you can use the **-c**, **-n**, **-N**, and **--exclusive** options (the long versions of these options are: --cpus-per-task, --ntasks, --nodes). But please see the notes about the **-n** and **-N** options below - they are probably not what you want.
  
 ---- ----
Line 46: Line 39:
 **cpus==cores** in our configuration. **cpus==cores** in our configuration.
  
-This option sets how many cores each of your tasks will use. (By default you are running one task.) Since the nodes in the **standard** partition have a maximum of 24 cores you can't get an allocation of more than 24 cores per task (in that partition). The job would be rejected +This option sets how many cores each of your tasks will use. (By default you are running one task.) Since the nodes in the **standard** partition have a maximum of 32 cores you can't get an allocation of more than 32 cores per task (in that partition). The job would be rejected 
  
 To get an entire node to yourself, regardless of the number of cores on the node, you can use the **--exclusive** option. To get an entire node to yourself, regardless of the number of cores on the node, you can use the **--exclusive** option.
  
-You should use the **-c** option if you are running a job that allows you to specify how many threads to run (or cores to use). So, if you have an option on the program you are running to say "use 8 cores" you should also tell SLURM that your program is using 8 cores. If you don't the node you are running on may get a lot more threads being run on it than the usual limit of 16 (or 24jobs allocated to it and could get bogged down.  +You should use the **-c** option if you are running a job that allows you to specify how many threads to run (or cores to use). So, if you have an option on the program you are running to say "use 8 cores" you should also tell SLURM that your program is using 8 cores. If you don'do this, your job will only be allowed to use 1 core (and the threads started by your program will be time-sliced on that one core, probably making it run much slower than you expected).  
  
 ---- ----
 +
 +=== --exclusive ===
 +
 +Request all the cores on a node. This will set the allocation of cores on that node to however many cores there are in total on the node. Since you likely want all the memory on the node as well, you should likely specify "--mem=0" along with "--exclusive".
 +
 +----
 +
 +=== --mem ===
 +
 +Use the --mem option to request memory for your job. e.g. --mem=25G will request 25GB of RAM. You can, and should, request less than your default allocation of 8GB if you don't need 8GB. This frees up that memory for other users.
 +
 +"--mem=0" requests all memory on a node. The actual amount allocated will depend on the node: our compute nodes have a minimum of 128GB of RAM, and at least 8GB per core.
  
 === -n, --ntasks === === -n, --ntasks ===
  
-This option is useful in very specific circumstances, and is not typically what users of the BRC cluster want. Avoid it unless you are sure it is what you want!+** This option is useful in very specific circumstances, and is not typically what users of the BRC cluster want. Avoid it unless you are sure it is what you want! **
  
-This option specifies the number of tasks you will be running (maximum). This can spread your allocation across multiple nodes: **-n 30** must use more than one node at one core per task. +This option specifies the number of tasks you will be running (maximum). This can spread your allocation across multiple nodes: **-n 40** must use more than one node at one core per task (in the standard partition)
  
-It behaves differently between sbatch and srun. For sbatch it will allocate the number of cores specified, possibly across multiple nodes, and start a single task on the first of those nodes. It is then up to your script to make use of the other allocated cores. For srun, it will cause one copy of your task to be run on each allocated core.+It behaves a little differently between sbatch and srun.  
 + 
 +For sbatch it will allocate the number of cores specified, possibly across multiple nodes, and run your script on the first of those nodes. It is then up to that script to make use of the other allocated cores by using srun from within itself. When you use srun from within the script, it will start one copy of the srun task on each of the allocated cores i.e. it will run multiple copies of the same task for each srun command. 
 + 
 +For srun, it will cause one copy of your task to be run on each allocated core i.e. it will run multiple copies of the same task for the one srun command.
  
 ---- ----
Line 66: Line 75:
 === -N, --nodes === === -N, --nodes ===
  
-This option has the same caveats as "-n" (see above): it behaves differently between sbatch and srun, and may well not be what you want.+** This option has the same caveats as "-n" (see above): it behaves differently between sbatch and srun, and may well not be what you want. **
  
 This option specifies how many nodes you want your tasks to run on. It does not allocate whole nodes to you. Using **-N 2** with none of the other relevant options would give you 1 core on each of two different nodes. This option specifies how many nodes you want your tasks to run on. It does not allocate whole nodes to you. Using **-N 2** with none of the other relevant options would give you 1 core on each of two different nodes.
  
----- +But, for sbatch, it will be up to your task to make sure that core is actually usedFor srun, your task will be run as many times as nodes you specified: it will be up to you to make sure your tasks don't all do exactly the same thing.
- +
-=== --exclusive === +
- +
-Request a whole node to yourselfThis will set the allocation of cores on that node to however many cores there are in total on the node.+
  
 ---- ----
running_jobs.1611928128.txt.gz · Last modified: 2021/01/29 08:48 by root