Differences

This shows you the differences between two versions of the page.

--- submitting_many_jobs [2020/03/11 17:11]
root
+++ submitting_many_jobs [2021/02/20 09:51] (current)
root
@@ Line 1: / Line 1: @@
 ====== Submitting Many Jobs ======
+You may need to run many (possibly similar) tasks on the cluster. You can do this by submitting each job separately to the cluster, techniques for doing that are discussed below. You could also do it by submitting a small number of jobs (possibly even just one) and have those jobs execute the many tasks you need to run. You should try not to submit thousands of separate jobs into the queue.
 To submit many jobs to the cluster you can:
@@ Line 18: / Line 20: @@
   * Use the **-n** option on **sbatch** and **srun** within the script to start multiple copies of a program.
-    * You can't exceed the number of cores available with this method (the job will be rejected). (But see the **overcommit** option.)
+    * You can't exceed the number of cores available with this method (the job will be rejected).
     * You can get each of your tasks to do something a little different (e.g. processing a different file) by using the SLURM_PROCID environment variable.
     * The "-l" option on srun will label output lines with the task number.
@@ Line 37: / Line 39: @@
 <code>
-sbatch -n 2 run_multi_job_o
+sbatch -n 2 run_multi_job_%t
 </code>
-==== Limiting the Number of Nodes Your Jobs Use ====
+===== Limiting the Number of Nodes/Cores Your Jobs Use =====
 If you submit many jobs to the cluster (perhaps using individual **sbatch** commands) they will use any of the nodes/cores available on the cluster (in the partition to which you submitted your job). If you submit, say, 1000 jobs you may well fill all available cores on the cluster and leave no resources for other users. So it is a good idea to limit the number of jobs that are running at any one time. An **array job** is an ideal way of doing this: it makes only a single entry in the job queue (rather than 1000 in our example) and it lets you specify the maximum number that should be run at any one time.
@@ Line 47: / Line 49: @@
 Array jobs are described in more detail on the [[sbatch]] page.
-There are some other options that may be useful described below.
+Some other options for limiting the number of cores your jobs use are described below.
-=== Sending a Job to a Specific Node ===
+==== Singleton Jobs ====
+You can use the **--dependency** option of sbatch to make slurm run just one of your jobs at a time. Suppose you submit 10 jobs with the same job name, using the **--dependency=singleton** option will make slurm run these jobs one at a time.
+<code>
+for i in $(seq 1 10); do
+sbatch --job-name oneatatime --dependency=singleton my_script.bash file_${i}.fasta
+done
+</code>
+==== Sending a Job to a Specific Node ====
 You can use the **-w** option to select a specific node. Actually it asks for "at least" the nodes in the node list you specify. So a command like:
@@ Line 67: / Line 79: @@
 would allocate one core on each of nodes 2,3, and 4 and run my_script on node2. You would then use srun from within your script to run job steps within this allocation of cores. Don't use this to limit the nodes you want your jobs to run on.
-=== Excluding Some Nodes ===
+==== Excluding Some Nodes ====
 You can use the **-x** option to avoid specific nodes. A list of node names looks like this:
@@ Line 81: / Line 93: @@
 You can use the **--exclusive** option to ask for exclusive access to all the nodes your job is allocated. This is especially useful if you have a program which attempts to use all the cores it finds. Please only use it if you need it.
-You can also use an array job to limit the number of sub-tasks running at any one time. This is a very good choice, and is highly recommended,  if you are submitting a lot of jobs. An array job occupies just one slot in the "pending jobs" section of the output of **squeue**.
+==== Using Job Steps ====
-=== Using Job Steps ===
+Files: test-multi.sh, pause.sh
 If you use **srun** within an **sbatch** script, the cores to be used for the jobs being srun can be sub-allocated from the cores alloted to the sbatch script. For example, the following script, test-multi.sh, that is to be submitted using **sbatch** specifies 3 tasks (defaulted to one core each).
@@ Line 108: / Line 120: @@
 </code>
-The "&" at the end of the srun line is important - else each srun will block, causing the srun steps to be executed consecutively.
+The "--ntasks 1" on each srun command is important because without it srun will start the pause.sh script on each of the cores allocated to the sbatch command (3 in this case).
-Similarly the "wait" at the end of the script is important - it stops the script from exiting before all the (now background) srun jobs have finished. Without the wait some of the jobs would likely still be running at the end of the script and would be terminated when their parent script ends.
+The "--exclusive" on each srun command is important as discussed above.
+The "&" at the end of the srun line is important: else each srun will block, causing the srun steps to be executed consecutively.
+Similarly the "wait" at the end of the script is important: it stops the script from exiting before all the (now background) srun jobs have finished. Without the wait some of the jobs would likely still be running at the end of the script and would be terminated when their parent script ends.
 Some example code for pause.sh could be as follows.
@@ Line 159: / Line 175: @@
 So the "--ntasks 3" sbatch option, and the "--ntasks 1 --exclusive" on the srun command limited the number of processes running at any one time to 3.
-This technique also works "across nodes", i.e. if I specify "--ntasks 50" as an sbatch option I will get job steps run on multiple nodes (because the nodes do have fewer than 50 cores each). In this case you will see messages from slurm saying:
+This technique also works "across nodes", i.e. if I specify "--ntasks 50" as an sbatch option I will get job steps run on multiple nodes (because the nodes have fewer than 50 cores each). In this case you will see messages from slurm saying:
 <code>
 srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1
 </code>

BRC Cluster Workshop

User Tools

Site Tools

Differences

Page Tools