User Tools

Site Tools


memory_use_monitoring

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
memory_use_monitoring [2014/04/23 15:19]
root
memory_use_monitoring [2021/02/26 12:48] (current)
root
Line 1: Line 1:
 ====== Memory Use Monitoring ====== ====== Memory Use Monitoring ======
 +
 +==== From the Command Line (on the head node) ====
 +
 +From the head node you can use the **sstat** command to check on memory use (and other details) of your jobs.
 +
 +<code>
 +sstat --format=JobID,MaxRSS 1234.batch
 +</code>
 +
 +The ".batch" part of the command above is literal (i.e. it is not a placeholder for some other piece of information, actually type ".batch"). This selects the memory usage of the submitted slurm batch job. If your job uses job steps you can replace the word "batch" with the job-step number, or use the "--allsteps" option to get details for all job steps.
 +
 +MaxRSS is the maximum amount of memory your job used (it is given for a specific node if your job is running across multiple nodes - not common with the software generally in use on our cluster).
 +
 +The **sstat** command also works on the nodes themselves. So, to record the maximum amount of memory your job used, you could put the **sstat** command above as the last line of the script that you submit. You can get the job id required using the SLURM_JOB_ID environment variable.
 +
 +<code>
 +#!/bin/bash
 +
 +... Commands needed to do your processing here ...
 +
 +sstat --format=JobID,MaxRSS ${SLURM_JOB_ID}.batch
 +</code>
 +
 +The output of the sstat command will then be written into the **slurm-NNNN.out** output file of your job.
 +
 +Without the format option:
 +
 +<code>
 +sstat 1234.batch
 +</code>
 +
 +will give you a lot of details about your job including memory and CPU usage.
 +
 +==== For Completed Jobs ====
 +
 +You can look up details of jobs that you ran in the past using the **sacct** command. For instance:
 +
 +<code>
 +sacct -j 1031100 -o "JobID,JobName,MaxRSS,Elapsed"
 +</code>
 +
 +would report on job number 103110, giving the jobname, maximum memory used, and elapsed time that the job took to run. 
 +
 +You will likely have a record of your job numbers in the names of the slurm-NNNNNN.out files that record the output of your jobs.
 +
 +==== From the Node ====
 +
 +=== Whole Node Monitoring ===
  
 Suppose you have a program to run and you expect it to use a lot of memory, but are unsure of exactly how much. Suppose you have a program to run and you expect it to use a lot of memory, but are unsure of exactly how much.
  
   * Run it on a node reserved exclusively for you.   * Run it on a node reserved exclusively for you.
-    * Use the **-c 16** option to get all cores on (standard) node.+    * Use the **--exclusive** option to get a whole node.
  
 +  * Use s**stat** as described above.
   * Run a background program to monitor memory use on that node.   * Run a background program to monitor memory use on that node.
  
Line 13: Line 62:
 srun memmon srun memmon
 </code> </code>
 +
 +leakmem allocates 1MB of RAM every second (for up to 10000 seconds).
 +
 +We use the SLURM_JOB_ID environment variable to get a unique name for the log file.
 +
 +=== Single Program Monitoring ===
 +
 +/proc/PID/smaps provides a detailed view of the memory being used by a specific process. You can parse it (or use some previously written Perl code, Linux::Smaps.pl).
 +
 +Files: memmon2, memlog2, leakmem.
 +
 +
memory_use_monitoring.1398280793.txt.gz · Last modified: 2014/04/23 15:19 by root