This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
submitting_many_jobs [2020/03/11 17:10] root |
submitting_many_jobs [2021/02/20 09:51] (current) root |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Submitting Many Jobs ====== | ====== Submitting Many Jobs ====== | ||
+ | |||
+ | You may need to run many (possibly similar) tasks on the cluster. You can do this by submitting each job separately to the cluster, techniques for doing that are discussed below. You could also do it by submitting a small number of jobs (possibly even just one) and have those jobs execute the many tasks you need to run. You should try not to submit thousands of separate jobs into the queue. | ||
To submit many jobs to the cluster you can: | To submit many jobs to the cluster you can: | ||
Line 18: | Line 20: | ||
* Use the **-n** option on **sbatch** and **srun** within the script to start multiple copies of a program. | * Use the **-n** option on **sbatch** and **srun** within the script to start multiple copies of a program. | ||
- | * You can't exceed the number of cores available with this method (the job will be rejected). | + | * You can't exceed the number of cores available with this method (the job will be rejected). |
* You can get each of your tasks to do something a little different (e.g. processing a different file) by using the SLURM_PROCID environment variable. | * You can get each of your tasks to do something a little different (e.g. processing a different file) by using the SLURM_PROCID environment variable. | ||
* The " | * The " | ||
Line 37: | Line 39: | ||
< | < | ||
- | sbatch -n 2 run_multi_job_o | + | sbatch -n 2 run_multi_job_%t |
</ | </ | ||
- | ==== Limiting the Number of Nodes Your Jobs Use ==== | + | ===== Limiting the Number of Nodes/ |
If you submit many jobs to the cluster (perhaps using individual **sbatch** commands) they will use any of the nodes/cores available on the cluster (in the partition to which you submitted your job). If you submit, say, 1000 jobs you may well fill all available cores on the cluster and leave no resources for other users. So it is a good idea to limit the number of jobs that are running at any one time. An **array job** is an ideal way of doing this: it makes only a single entry in the job queue (rather than 1000 in our example) and it lets you specify the maximum number that should be run at any one time. | If you submit many jobs to the cluster (perhaps using individual **sbatch** commands) they will use any of the nodes/cores available on the cluster (in the partition to which you submitted your job). If you submit, say, 1000 jobs you may well fill all available cores on the cluster and leave no resources for other users. So it is a good idea to limit the number of jobs that are running at any one time. An **array job** is an ideal way of doing this: it makes only a single entry in the job queue (rather than 1000 in our example) and it lets you specify the maximum number that should be run at any one time. | ||
Line 47: | Line 49: | ||
Array jobs are described in more detail on the [[sbatch]] page. | Array jobs are described in more detail on the [[sbatch]] page. | ||
- | There are some other options | + | Some other options |
- | === Sending a Job to a Specific Node === | + | ==== Singleton Jobs ==== |
+ | |||
+ | You can use the **--dependency** option of sbatch to make slurm run just one of your jobs at a time. Suppose you submit 10 jobs with the same job name, using the **--dependency=singleton** option will make slurm run these jobs one at a time. | ||
+ | |||
+ | < | ||
+ | for i in $(seq 1 10); do | ||
+ | sbatch --job-name oneatatime --dependency=singleton my_script.bash file_${i}.fasta | ||
+ | done | ||
+ | </ | ||
+ | |||
+ | ==== Sending a Job to a Specific Node ==== | ||
You can use the **-w** option to select a specific node. Actually it asks for "at least" the nodes in the node list you specify. So a command like: | You can use the **-w** option to select a specific node. Actually it asks for "at least" the nodes in the node list you specify. So a command like: | ||
Line 67: | Line 79: | ||
would allocate one core on each of nodes 2,3, and 4 and run my_script on node2. You would then use srun from within your script to run job steps within this allocation of cores. Don't use this to limit the nodes you want your jobs to run on. | would allocate one core on each of nodes 2,3, and 4 and run my_script on node2. You would then use srun from within your script to run job steps within this allocation of cores. Don't use this to limit the nodes you want your jobs to run on. | ||
- | === Excluding Some Nodes === | + | ==== Excluding Some Nodes ==== |
You can use the **-x** option to avoid specific nodes. A list of node names looks like this: | You can use the **-x** option to avoid specific nodes. A list of node names looks like this: | ||
Line 81: | Line 93: | ||
You can use the **--exclusive** option to ask for exclusive access to all the nodes your job is allocated. This is especially useful if you have a program which attempts to use all the cores it finds. Please only use it if you need it. | You can use the **--exclusive** option to ask for exclusive access to all the nodes your job is allocated. This is especially useful if you have a program which attempts to use all the cores it finds. Please only use it if you need it. | ||
- | You can also use an array job to limit the number of sub-tasks running at any one time. This is a very good choice, and is highly recommended, | + | ==== Using Job Steps ==== |
- | === Using Job Steps === | + | Files: test-multi.sh, |
If you use **srun** within an **sbatch** script, the cores to be used for the jobs being srun can be sub-allocated from the cores alloted to the sbatch script. For example, the following script, test-multi.sh, | If you use **srun** within an **sbatch** script, the cores to be used for the jobs being srun can be sub-allocated from the cores alloted to the sbatch script. For example, the following script, test-multi.sh, | ||
Line 108: | Line 120: | ||
</ | </ | ||
- | The "&" | + | The "--ntasks 1" |
- | Similarly the " | + | The " |
+ | |||
+ | The "&" | ||
+ | |||
+ | Similarly the " | ||
Some example code for pause.sh could be as follows. | Some example code for pause.sh could be as follows. | ||
Line 155: | Line 171: | ||
</ | </ | ||
- | Going through this output, we see the 7 iterations of the loop in test-multi.sh being executed (the "echo $i" in the loop prints the numbers 1-7). Then we see three " | + | Going through this output, we see the 7 iterations of the loop in test-multi.sh being executed (the "echo $i" in the loop prints the numbers 1-7). These happen in quick succession because of the "&" |
So the " | So the " | ||
- | This technique also works " | + | This technique also works " |
< | < | ||
srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1 | srun: Warning: can't run 1 processes on 3 nodes, setting nnodes to 1 | ||
</ | </ | ||
- | |||