User Tools

Site Tools


enforcing_core_counts

This is an old revision of the document!


Enforcing Core Counts

On the old cluster SLURM was configured so that it tracked allocated cores on each node, but did not limit a job from using more cores than specified in the submission i.e. a job could be submitted with the default 1-core allocation but run a program that started 10 threads and actually start using 10 cores. This occasionally led to nodes being bogged down.

On the new cluster jobs will have access to exactly as many cores as requested in the job submission. If you submit a job requesting 1 core, and start a program that uses 10 threads, all 10 threads will be time-sliced on 1 core. Your job might run something like 10 times slower than you expected it to.

So, you should be more careful in specifying how many cores your job needs. See the “-c” or “–cpus-per-task” option to sbatch.

If you suspect that a job you previously ran on the old cluster used more cores than were allocated to it, you might be able to get some idea of whether than is true using sacct to investigate the elapsed time that the job took to run compared to the total CPU time the job used.

sacct --user=chris -S 2020-01-01 --format=JobID,JobName,AllocCPUS,elapsed,CPUTime,TotalCPU

If the TotalCPU time is much greater than the CPUTime it is likely that your job was using more cores than allocated. The number of cores it was actually using can be estimated by converting the two times into minutes or seconds and dividing TotalCPU by CPUTime. The CPUTime is equal to the number of allocated cores multiplied by the elapsed time. The TotalCPU time is the amount of CPU time used as measured by the OS (user and system CPU time).

enforcing_core_counts.1624995553.txt.gz · Last modified: 2021/06/29 15:39 by root