Displaying Computing Resources
As stated above, computing resources are nodes, CPUs, memory, and generic resources like GPUs. The resources of each compute node can be seen by running the scontrol show node command. The characteristics of each partition can be seen by running the scontrol show partition command. Finally, a load summary report for each partition can be seen by running sinfo.
To show a summary of cluster resources on a per partition basis:
[user@gl-login1 ~]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST standard up 14-00:00:00 24 idle gl31[60-83] gpu up 14-00:00:00 2 idle gl10[18-19] largemem up 14-00:00:00 3 idle gl000[0-3]
[user@gl-login1 ~]$ sstate ——————————————————————————————————————— Node AllocCPU TotalCPU PercentUsedCPU CPULoad AllocMem TotalMem PercentUsedMem NodeState ——————————————————————————————————————— gl3160 0 36 0.00 0.03 0 192000 0.00 IDLE gl3160 0 36 0.00 0.04 0 192000 0.00 IDLE ...
In this example the user “user” has access to submit workloads to the accounts support and hpcstaff on the Great Lakes cluster. To show associations for the current user:
[user@gl-login1 ~]$ sacctmgr show assoc user=$USER Cluster Account User Partition ... ——————————————————————————————————————— greatlakes support user 1 greatlakes hpcstaff user 1
Job Statistics and Accounting
The sreport command provides aggregated usage reports by user and account over a specified period. Examples:
By user: sreport -T billing cluster AccountUtilizationByUser Start=2017-01-01 End=2017-12-31
By account: sreport -T billing cluster UserUtilizationByAccount Start=2017-01-01 End=2017-12-31
For all of the sreport options see the sreport man page.
Time Remaining in an Application
If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, applications have two means for discovering the time remaining in the application.
The first means is to use the sbatch --signal=<sig_num>[@<sig_time>] option to request a signal (like USR1 or USR2) at sig_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.
The second means is for the application to issue a library call to retrieve its remaining time periodically. When the library call returns a remaining time below a certain threshold, the application can take the necessary steps to write a checkpoint file and terminate gracefully.
Slurm offers the slurm_get_rem_time() library call that returns the time remaining. On some systems, the yogrt library (man yogrt) is also available to provide the time remaining.