Slurm Power Saving Guide

Slurm provides an integrated power saving mechanism for powering down idle nodes. Nodes that remain idle for a configurable period of time can be placed in a power saving mode, which can reduce power consumption or fully power down the node. The nodes will be restored to normal operation once work is assigned to them. For example, power saving can be accomplished using a cpufreq governor that can change CPU frequency and voltage (note that the cpufreq driver must be enabled in the Linux kernel configuration). Of particular note, Slurm can power nodes up or down at a configurable rate to prevent rapid changes in power demands. For example, starting a 1000 node job on an idle cluster could result in an instantaneous surge in power demand of multiple megawatts without Slurm's support to increase power demands in a gradual fashion.

Configuration

A great deal of flexibility is offered in terms of when and how idle nodes are put into or removed from power save mode. Note that the Slurm control daemon, slurmctld, must be restarted to initially enable power saving mode. Changes in the configuration parameters (e.g. SuspendTime) will take effect after modifying the slurm.conf configuration file and executing "scontrol reconfig". The following configuration parameters are available:

SuspendTime: Nodes becomes eligible for power saving mode after being idle or down for this number of seconds. For efficient system utilization, it is recommended that the value of SuspendTime be at least as large as the sum of SuspendTimeout plus ResumeTimeout. A negative number disables power saving mode. The default value is -1 (disabled).
SuspendRate: Maximum number of nodes to be placed into power saving mode per minute. A value of zero results in no limits being imposed. The default value is 60. Use this to prevent rapid drops in power consumption.
ResumeRate: Maximum number of nodes to be removed from power saving mode per minute. A value of zero results in no limits being imposed. The default value is 300. Use this to prevent rapid increases in power consumption.
SuspendProgram: Program to be executed to place nodes into power saving mode. The program executes as SlurmUser (as configured in slurm.conf). The argument to the program will be the names of nodes to be placed into power savings mode (using Slurm's hostlist expression format).
ResumeProgram: Program to be executed to remove nodes from power saving mode. The program executes as SlurmUser (as configured in slurm.conf). The argument to the program will be the names of nodes to be removed from power savings mode (using Slurm's hostlist expression format). This program may use the scontrol show node command to ensure that a node has booted and the slurmd daemon started. If the slurmd daemon fails to respond within the configured SlurmdTimeout value with an updated BootTime, the node will be placed in a DOWN state and the job requesting the node will be requeued. If the node isn't actually rebooted (i.e. when multiple-slurmd is configured) starting slurmd with "-b" option might be useful. For reasons of reliability, ResumeProgram may execute more than once for a node when the slurmctld daemon crashes and is restarted.
SuspendTimeout: Maximum time permitted (in second) between when a node suspend request is issued and when the node shutdown is complete. At that time the node must ready for a resume request to be issued as needed for new workload. The default value is 30 seconds.
ResumeTimeout: Maximum time permitted (in seconds) between when a node resume request is issued and when the node is actually available for use. Nodes which fail to respond in this time frame will be marked DOWN and the jobs scheduled on the node requeued. Nodes which reboot after this time frame will be marked DOWN with a reason of "Node unexpectedly rebooted." The default value is 60 seconds.
SuspendExcNodes: List of nodes to never place in power saving mode. Use Slurm's hostlist expression format. By default, no nodes are excluded.
SuspendExcParts: List of partitions with nodes to never place in power saving mode. Multiple partitions may be specified using a comma separator. By default, no nodes are excluded.
BatchStartTimeout: Specifies how long to wait after a batch job start request is issued before we expect the batch job to be running on the compute node. Depending upon how nodes are returned to service, this value may need to be increased above its default value of 10 seconds.

Note that SuspendProgram and ResumeProgram execute as SlurmUser on the node where the slurmctld daemon runs (primary and backup server nodes). Use of sudo may be required for SlurmUserto power down and restart nodes. If you need to convert Slurm's hostlist expression into individual node names, the scontrol show hostnames command may prove useful. The commands used to boot or shut down nodes will depend upon your cluster management tools.

Note that SuspendProgram and ResumeProgram are not subject to any time limits. They should perform the required action, ideally verify the action (e.g. node boot and start the slurmd daemon, thus the node is no longer non-responsive to slurmctld) and terminate. Long running programs will be logged by slurmctld, but not aborted.

Also note that the stderr/out of the suspend and resume programs are not logged. If logging is desired it should be added to the scripts.

#!/bin/bash
# Example SuspendProgram
echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_shutdown $host
done

#!/bin/bash
# Example ResumeProgram
echo "`date` Resume invoked $0 $*" >>/var/log/power_save.log
hosts=`scontrol show hostnames $1`
for host in $hosts
do
   sudo node_startup $host
done

Subject to the various rates, limits and exclusions, the power save code follows this logic:

Identify nodes which have been idle for at least SuspendTime.
Execute SuspendProgram with an argument of the idle node names.
Identify the nodes which are in power save mode (a flag in the node's state field), but have been allocated to jobs.
Execute ResumeProgram with an argument of the allocated node names.
Once the slurmd responds, initiate the job and/or job steps allocated to it.
If the slurmd fails to respond within the value configured for SlurmdTimeout, the node will be marked DOWN and the job requeued if possible.
Repeat indefinitely.

The slurmctld daemon will periodically (every 10 minutes) log how many nodes are in power save mode using messages of this sort:

[May 02 15:31:25] Power save mode 0 nodes
...
[May 02 15:41:26] Power save mode 10 nodes
...
[May 02 15:51:28] Power save mode 22 nodes

Using these logs you can easily see the effect of Slurm's power saving support. You can also configure Slurm with programs that perform no action as SuspendProgram and ResumeProgram to assess the potential impact of power saving mode before enabling it.

Use of Allocations

A resource allocation request will be granted as soon as resources are selected for use, possibly before the nodes are all available for use. The launching of job steps will be delayed until the required nodes have been restored to service (it prints a warning about waiting for nodes to become available and periodically retries until they are available).

In the case of an sbatch command, the batch program will start when node zero of the allocation is ready for use and pre-processing can be performed as needed before using srun to launch job steps. The sbatch --wait-all-nodes=<value> command can be used to override this behavior on a per-job basis and a system-wide default can be set with the SchedulerParameters=sbatch_wait_nodes option.

In the case of the salloc command, once the allocation is made a new shell will be created on the login node. The salloc --wait-all-nodes=<value> command can be used to override this behavior on a per-job basis and a system-wide default can be set with the SchedulerParameters=salloc_wait_nodes option.

Fault Tolerance

If the slurmctld daemon is terminated gracefully, it will wait up to SuspendTimeout or ResumeTimeout (whichever is larger) for any spawned SuspendProgram or ResumeProgram to terminate before the daemon terminates. If the spawned program does not terminate within that time period, the event will be logged and slurmctld will exit in order to permit another slurmctld daemon to be initiated. Synchronization problems could also occur when the slurmctld daemon crashes (a rare event) and is restarted.

In either event, the newly initiated slurmctld daemon (or the backup server) will recover saved node state information that may not accurately describe the actual node state. In the case of a failed SuspendProgram, the negative impact is limited to increased power consumption, so no special action is currently taken to execute SuspendProgram multiple times in order to ensure the node is in a reduced power mode. The case of a failed ResumeProgram is more serious in that the node could be placed into a DOWN state and/or jobs could fail. In order to minimize this risk, when the slurmctld daemon is started and node which should be allocated to a job fails to respond, the ResumeProgram will be executed (possibly for a second time).

Booting Different Images

Slurm's PrologSlurmctld configuration parameter can identify a program to boot different operating system images for each job based upon its constraint field (or possibly comment). If you want ResumeProgram to boot a various images according to job specifications, it will need to be a fairly sophisticated program and perform the following actions:

Determine which jobs are associated with the nodes to be booted
Determine which image is required for each job and
Boot the appropriate image for each node

Last modified 11 November 2019