This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
memory_use_monitoring [2014/11/04 17:40] root |
memory_use_monitoring [2021/02/26 12:48] (current) root |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Memory Use Monitoring ====== | ====== Memory Use Monitoring ====== | ||
- | ==== Whole Node Monitoring | + | ==== From the Command Line (on the head node) ==== |
+ | |||
+ | From the head node you can use the **sstat** command to check on memory use (and other details) of your jobs. | ||
+ | |||
+ | < | ||
+ | sstat --format=JobID, | ||
+ | </ | ||
+ | |||
+ | The " | ||
+ | |||
+ | MaxRSS is the maximum amount of memory your job used (it is given for a specific node if your job is running across multiple nodes - not common with the software generally in use on our cluster). | ||
+ | |||
+ | The **sstat** command also works on the nodes themselves. So, to record the maximum amount of memory your job used, you could put the **sstat** command above as the last line of the script that you submit. You can get the job id required using the SLURM_JOB_ID environment variable. | ||
+ | |||
+ | < | ||
+ | # | ||
+ | |||
+ | ... Commands needed to do your processing here ... | ||
+ | |||
+ | sstat --format=JobID, | ||
+ | </ | ||
+ | |||
+ | The output of the sstat command will then be written into the **slurm-NNNN.out** output file of your job. | ||
+ | |||
+ | Without the format option: | ||
+ | |||
+ | < | ||
+ | sstat 1234.batch | ||
+ | </ | ||
+ | |||
+ | will give you a lot of details about your job including memory and CPU usage. | ||
+ | |||
+ | ==== For Completed Jobs ==== | ||
+ | |||
+ | You can look up details of jobs that you ran in the past using the **sacct** command. For instance: | ||
+ | |||
+ | < | ||
+ | sacct -j 1031100 -o " | ||
+ | </ | ||
+ | |||
+ | would report on job number 103110, giving the jobname, maximum memory used, and elapsed time that the job took to run. | ||
+ | |||
+ | You will likely have a record of your job numbers in the names of the slurm-NNNNNN.out files that record the output of your jobs. | ||
+ | |||
+ | ==== From the Node ==== | ||
+ | |||
+ | === Whole Node Monitoring === | ||
Suppose you have a program to run and you expect it to use a lot of memory, but are unsure of exactly how much. | Suppose you have a program to run and you expect it to use a lot of memory, but are unsure of exactly how much. | ||
Line 8: | Line 54: | ||
* Use the **--exclusive** option to get a whole node. | * Use the **--exclusive** option to get a whole node. | ||
+ | * Use s**stat** as described above. | ||
* Run a background program to monitor memory use on that node. | * Run a background program to monitor memory use on that node. | ||
Line 20: | Line 67: | ||
We use the SLURM_JOB_ID environment variable to get a unique name for the log file. | We use the SLURM_JOB_ID environment variable to get a unique name for the log file. | ||
- | ==== Single Program Monitoring | + | === Single Program Monitoring === |
/ | / |