Advanced Topics - Slurm - Compute Resources and Accounting

Displaying Computing Resources

As stated above, computing resources are nodes, CPUs, memory, and generic resources like GPUs.  The resources of each compute node can be seen by running the scontrol show node command.  The characteristics of each partition can be seen by running the scontrol show partition command.  Finally, a load summary report for each partition can be seen by running sinfo.

To show a summary of cluster resources on a per partition basis:

[user@gl-login1 ~]$ sinfo
PARTITION     AVAIL    TIMELIMIT    NODES STATE   NODELIST
standard      up       14-00:00:00  24    idle    gl31[60-83]
gpu           up       14-00:00:00  2     idle    gl10[18-19]
largemem      up       14-00:00:00  3     idle    gl000[0-3]
[user@gl-login1 ~]$ sstate
———————————————————————————————————————
Node    AllocCPU TotalCPU PercentUsedCPU  CPULoad AllocMem TotalMem PercentUsedMem NodeState
———————————————————————————————————————
gl3160  0        36       0.00            0.03    0        192000   0.00           IDLE
gl3160  0        36       0.00            0.04    0        192000   0.00           IDLE
...

In this example the user “user” has access to submit workloads to the accounts support and hpcstaff on the Great Lakes cluster. To show associations for the current user:

[user@gl-login1 ~]$ sacctmgr show assoc user=$USER

Cluster      Account  User  Partition  ...
———————————————————————————————————————
greatlakes   support  user  1    
greatlakes   hpcstaff user  1

Job Statistics and Accounting

The sreport command provides aggregated usage reports by user and account over a specified period. Examples:

By user: sreport -T billing cluster AccountUtilizationByUser Start=2017-01-01 End=2017-12-31

By account: sreport -T billing cluster UserUtilizationByAccount Start=2017-01-01 End=2017-12-31

For all of the sreport options see the sreport man page.

Time Remaining in an Application

If a running application overruns its wall clock limit, all its work could be lost. To prevent such an outcome, applications have two means for discovering the time remaining in the application.

The first means is to use the sbatch --signal=<sig_num>[@<sig_time>] option to request a signal (like USR1 or USR2) at sig_time number of seconds before the allocation expires. The application must register a signal handler for the requested signal in order to to receive it. The handler takes the necessary steps to write a checkpoint file and terminate gracefully.

The second means is for the application to issue a library call to retrieve its remaining time periodically. When the library call returns a remaining time below a certain threshold, the application can take the necessary steps to write a checkpoint file and terminate gracefully.

Slurm offers the slurm_get_rem_time() library call that returns the time remaining. On some systems, the yogrt library (man yogrt) is also available to provide the time remaining.