Slurm Power Management Guide

Slurm provide an integrated power management system for power capping. The mode of operation is to take the configured power cap for the system and distribute it across the compute nodes under Slurm control. Initially that power is distributed evenly across all compute nodes. Slurm then monitors actual power consumption and redistributes power as appropriate. Specifically, Slurm lowers the power caps on nodes using less than their cap and redistributes that power across the other nodes. The thresholds at which a node's power cap are raised or lowered are configurable as are the rate of change the power cap. In addition, starting a job on a node immediately triggers resetting the node's power cap to a higher level. Note this functionality is distinct from Slurm's ability to power down idle nodes.

Configuration

The following configuration parameters are available:

  • DebugFlags=power: Enable plugin-specific logging messages.
  • PowerParameters: Defines power management behavior. Changes to this value take effect when the Slurm daemons are reconfigured. Currently valid options are:
    • balance_interval=# - Specifies the time interval, in seconds, between attempts to balance power caps across the nodes. This also controls the frequency at which Slurm attempts to collect current power consumption data (old data may be used until new data is available from the underlying infrastructure and values below 10 seconds are not recommended for Cray systems). The default value is 30 seconds. Supported by the power/cray_aries plugin.
    • capmc_path=/... - Specifies the absolute path of the capmc command. The default value is "/opt/cray/capmc/default/bin/capmc". Supported by the power/cray_aries plugin.
    • cap_watts=#[KW|MW] - Specifies the total power limit to be established across all compute nodes managed by Slurm. A value of 0 sets every compute node to have an unlimited cap. The default value is 0. Supported by the power/cray_aries plugin.
    • decrease_rate=# - Specifies the maximum rate of change in the power cap for a node where the actual power usage is below the power cap by an amount greater than lower_threshold (see below). Value represents a percentage of the difference between a node's minimum and maximum power consumption. The default value is 50 percent. Supported by the power/cray_aries plugin.
    • increase_rate=# - Specifies the maximum rate of change in the power cap for a node where the actual power usage is within upper_threshold (see below) of the power cap. Value represents a percentage of the difference between a node's minimum and maximum power consumption. The default value is 20 percent. Supported by the power/cray_aries plugin.
    • job_level - All compute nodes associated with every job will be assigned the same power cap. Nodes shared by multiple jobs with have a power cap different from other nodes allocated to the individual jobs. By default, this is configurable by the user for each job.
    • job_no_level - Power caps are established independently for each compute node. This disabled the "--power=level" option available in the job submission commands. By default, this is configurable by the user for each job.
    • lower_threshold=# - Specify a lower power consumption threshold. If a node's current power consumption is below this percentage of its current cap, then its power cap will be reduced. The default value is 90 percent. Supported by the power/cray_aries plugin.
    • recent_job=# - If a job has started or resumed execution (from suspend) on a compute node within this number of seconds from the current time, the node's power cap will be increased to the maximum. The default value is 300 seconds. Supported by the power/cray_aries plugin.
    • set_watts=# - Specifies the power limit to be set on every compute nodes managed by Slurm. Every node gets this same power cap and there is no variation through time based upon actual power usage on the node. Supported by the power/cray_aries plugin.
    • upper_threshold=# - Specify an upper power consumption threshold. If a node's current power consumption is above this percentage of its current cap, then its power cap will be increased to the extent possible. A node's power cap will also be increased if a job is newly started on it. The default value is 95 percent. Supported by the power/cray_aries plugin.
  • PowerPlugin: Identifies the plugin used to manage system power consumption. Changes to this value require restarting Slurm daemons to take effect. By default, no power plugin is loaded. Currently valid options are:
    • power/cray_aries - Used for Cray XC systems with power monitoring and management functionality included as part of System Management Workstation (SMW) 7.0.UP03.
    • power/none - No power management support.

Note for Cray systems: The JSON-C library must be installed in order to build Slurm's power/cray_aries plugin, which must parse JSON format data. See Slurm's JSON installation information for details.

Note for Cray systems: Power management is provided for native Slurm configurations (i.e. without the ALPS resource manager).

Note for Cray systems: Use of the capmc command requires either specifying its absolute path ("/opt/cray/capmc/default/bin/capmc" by default) or loading the capmc module:

$ module load capmc

User and System Administrator Commands

Equal power caps for all nodes allocated to a job can be requested at job submission time by using the "--power=level" option with the salloc, sbatch or srun command. The system administrator can override the user option with the PowerParameters configuration parameter and the job_level or job_no_level option.

Specific minimum and maximum CPU frequency in addition to CPU governor may be requested at job submit time using the "--cpu-freq" option with the salloc, sbatch or srun command. The frequency requested may be "low", "medium", "highm1" (second highest available frequency), "high" or a specific frequency (expressed as a KHz value). The governor specification may be "conservative", "ondemand", "performance" or "powersave". These values are user requests subject to system constraints. Some examples follow.

$ sbatch --cpu-freq=2400000-3000000 ...
$ salloc --cpu-freq=powersave ...
$ srun --cpu-freq=highm1 ...

The power consumption and power cap data are available for all compute nodes using either the "scontrol show node" or sview commands. Information available includes "CurrentWatts" and "CapWatts".

Example

Initial State

In our example, assume the following configuration: 10 compute node cluster, where each node has a minimum power consumption of 100 watts and maximum power consumption of 200 watts. The following values for PowerParameters: balance_interval=60, cap_watts=1800, decrease_rate=30, increase_rate=10, lower_threshold=90, upper_threshold=98. The initial state is simply based upon the cap_watts divided by the number of compute nodes: 1800 watts / 10 nodes = 180 watts per node.

State in 60 Seconds

The power consumption is then examined balance_interval (60) seconds later. Assume that one of those nodes is consuming 110 watts and the others are using 180 watts. First we identify which nodes are consuming less than their lower_threshold of the power cap: 90% x 180 watts = 162 watts. One node falls in this category with 110 watts of power consumption. Its power cap is reduced by either half of the difference between its current power cap and power consumption ((180 watts - 110 watts) / 2 = 35 watts) OR decrease_rate, which is a percentage of the difference between the node's maximum and minimum power consumption ((200 watts - 100 watts) x 30% = 30 watts). So that node's power cap is reduce from 180 watts to 150 watts. Ignoring the upper_threshold parameter for now, we now have 1650 watts available to distribute to the remaining 9 compute nodes, or 183 watts per node (1650 watts / 9 nodes = 183 watts per node).

State in 120 Seconds

The power consumption is then examined balance_interval (60) seconds later. Assume that one of those nodes is still consuming 110 watts, a second node is consuming 115 watts and the other eight are using 183 watts. First we identify which nodes are consuming less than their lower_threshold. Our node using 110 watts has its cap reduced by half of the difference between its current power cap and power consumption ((150 watts - 110 watts) / 2 = 20 watts); so that node's power cap is reduce from 150 watts to 130 watts. The node consuming 115 watts has its power cap reduced by 30 watts based decrease_rate; so that node's power cap is reduce from 183 watts to 153 watts. That leaves 1517 watts (1800 watts - 130 watts - 153 watts = 1517 watts) to be distributed over 8 nodes or 189 watts per node.

State in 180 Seconds

The power consumption is then examined balance_interval (60) seconds later. Assume the node previously consuming 110 watts is now consuming 128 watts. Since that is over upper_threshold of its power cap (98% x 130 watts = 127 watts), its power cap is increased by increase_rate ((200 watts - 100 watts) x 10% = 10 watts), so its power cap goes from 130 watts to 140 watts. Assume the node previously consuming 115 watts has been allocated a new job. This triggers the node to be allocated the same power cap as nodes previously running at their power cap. Therefore we have 1660 watts available (1800 watts - 140 watts = 1660 watts) to be distributed over 9 nodes or 184 watts per node.

Notes

  • Slurm's power management plugin can be used in conjunction with the power save mode, where idle nodes are powered down and then powered back up as needed. On a Cray system, set each node's power cap to the minimum value before powering it down. Also set the default power cap of each node to the minimum value as that will be used at power up time.
  • Cray permits independent power capping for accelerators (GPUs or MICs), which is not currently used by Slurm.
  • Current default values for configuration parameters should probably be changed once we have a better understanding of the algorithm's behavior.
  • No integration of this logic with gang scheduling currently exists. It is not clear that configuration is practical to support as gang scheduling time slices will typically be smaller than the power management balance_interval and synchronizing changes may be difficult
  • There can be situations where capmc program gets stuck for some reason and the node remains in IDLE*+POWER state until ResumeTimeout is reached, despite it has been rebooted or manually cleaned. In this situation the node can be brought back into service issuing an 'scontrol update nodename=xxx state=power_down' which will cancel the previous power_up request. Then capmc program must be diagnosed and fixed.

Last modified 7 Mar 2018