Differences

This shows you the differences between two versions of the page.

--- enforcing_core_counts [2021/06/29 15:38]
root
+++ enforcing_core_counts [2022/01/26 10:01] (current)
root
@@ Line 1: / Line 1: @@
 ===== Enforcing Core Counts =====
-On the old cluster SLURM was configured so that it tracked allocated cores on each node, but did not limit a job from using more cores than specified in the submission i.e. a job could be submitted with the default 1-core allocation but run a program that started 10 threads and actually start using 10 cores. This occasionally led to nodes being bogged down.
+Running cluster jobs have access to exactly as many cores as requested in the job submission. If you submit a job requesting 1 core, and start a program that uses 10 threads, all 10 threads will be time-sliced on that 1 core. Your job might run something like 10 times slower than you expected it to!
-On the new cluster jobs will have access to exactly as many cores as requested in the job submission. If you submit a job requesting 1 core, and start a program that uses 10 threads, all 10 threads will be time-sliced on 1 core. Your job might run something like 10 times slower than you expected it to.
 So, you should be more careful in specifying how many cores your job needs. See the "-c" or "--cpus-per-task" option to sbatch.
-If you suspect that a job you previously ran on the old cluster used more cores than were allocated to it, you might be able to get some idea of whether than is true using **sacct** to investigate the elapsed time that the job took to run compared to the total CPU time the job used.
+Before the 2021 cluster update, the cluster did not enforce this core count restriction. If you suspect that a job you previously ran on the old cluster used more cores than were allocated to it, you might be able to get some idea of whether than is true using **sacct** to investigate the elapsed time that the job took to run compared to the total CPU time the job used.
 <code>
@@ Line 13: / Line 11: @@
 </code>
-If the TotalCPU time is much greater than the CPUTime it is likely that your job was using more cores than allocated. The number of cores it was actually using can be estimated by converting the two times into minutes or seconds and dividing TotalCPU by CPUTime. (The CPUTime is equal to the number of allocated cores multiplied by the elapsed time. The TotalCPU time is the amount of CPU time used as measured by the OS.)
+If the TotalCPU time is much greater than the CPUTime it is likely that your job was using more cores than allocated. The number of cores it was actually using can be estimated by converting the two times into minutes or seconds and dividing TotalCPU by CPUTime. The CPUTime is equal to the number of allocated cores multiplied by the elapsed time. The TotalCPU time is the amount of CPU time used as measured by the OS (user and system CPU time).
+Some programs "decide for themselves" how many threads/cores to use (e.g. Ropen with the MKL does this by default). So you may see your code unexpectedly running slower on the new cluster if you weren't paying attention to this.

BRC Cluster Workshop

User Tools

Site Tools

Differences

Page Tools