This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
memory_use_monitoring [2014/04/23 15:21] root |
memory_use_monitoring [2021/02/26 12:48] (current) root |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Memory Use Monitoring ====== | ====== Memory Use Monitoring ====== | ||
+ | |||
+ | ==== From the Command Line (on the head node) ==== | ||
+ | |||
+ | From the head node you can use the **sstat** command to check on memory use (and other details) of your jobs. | ||
+ | |||
+ | < | ||
+ | sstat --format=JobID, | ||
+ | </ | ||
+ | |||
+ | The " | ||
+ | |||
+ | MaxRSS is the maximum amount of memory your job used (it is given for a specific node if your job is running across multiple nodes - not common with the software generally in use on our cluster). | ||
+ | |||
+ | The **sstat** command also works on the nodes themselves. So, to record the maximum amount of memory your job used, you could put the **sstat** command above as the last line of the script that you submit. You can get the job id required using the SLURM_JOB_ID environment variable. | ||
+ | |||
+ | < | ||
+ | #!/bin/bash | ||
+ | |||
+ | ... Commands needed to do your processing here ... | ||
+ | |||
+ | sstat --format=JobID, | ||
+ | </ | ||
+ | |||
+ | The output of the sstat command will then be written into the **slurm-NNNN.out** output file of your job. | ||
+ | |||
+ | Without the format option: | ||
+ | |||
+ | < | ||
+ | sstat 1234.batch | ||
+ | </ | ||
+ | |||
+ | will give you a lot of details about your job including memory and CPU usage. | ||
+ | |||
+ | ==== For Completed Jobs ==== | ||
+ | |||
+ | You can look up details of jobs that you ran in the past using the **sacct** command. For instance: | ||
+ | |||
+ | < | ||
+ | sacct -j 1031100 -o " | ||
+ | </ | ||
+ | |||
+ | would report on job number 103110, giving the jobname, maximum memory used, and elapsed time that the job took to run. | ||
+ | |||
+ | You will likely have a record of your job numbers in the names of the slurm-NNNNNN.out files that record the output of your jobs. | ||
+ | |||
+ | ==== From the Node ==== | ||
+ | |||
+ | === Whole Node Monitoring === | ||
Suppose you have a program to run and you expect it to use a lot of memory, but are unsure of exactly how much. | Suppose you have a program to run and you expect it to use a lot of memory, but are unsure of exactly how much. | ||
* Run it on a node reserved exclusively for you. | * Run it on a node reserved exclusively for you. | ||
- | * Use the **-c 16** option to get all cores on a (standard) | + | * Use the **--exclusive** option to get a whole node. |
+ | * Use s**stat** as described above. | ||
* Run a background program to monitor memory use on that node. | * Run a background program to monitor memory use on that node. | ||
Line 15: | Line 64: | ||
leakmem allocates 1MB of RAM every second (for up to 10000 seconds). | leakmem allocates 1MB of RAM every second (for up to 10000 seconds). | ||
+ | |||
+ | We use the SLURM_JOB_ID environment variable to get a unique name for the log file. | ||
+ | |||
+ | === Single Program Monitoring === | ||
+ | |||
+ | / | ||
+ | |||
+ | Files: memmon2, memlog2, leakmem. | ||
+ | |||