Cloud Scheduling Guide

Overview

Slurm has the ability to support a cluster that grows and shrinks on demand, typically relying upon a service such as Amazon Elastic Computing Cloud (Amazon EC2), Google Cloud Platform or Microsoft Azure for resources. These resources can be combined with an existing cluster to process excess workload (cloud bursting) or it can operate as an independent self-contained cluster. Good responsiveness and throughput can be achieved while you only pay for the resources needed.

The rest of this document describes details about Slurm's infrastructure that can be used to support cloud scheduling.

Slurm's cloud scheduling logic relies heavily upon the existing power save logic. Review of Slurm's Power Saving Guide is strongly recommended. This logic initiates programs when nodes are required for use and another program when those nodes are no longer required. For cloud scheduling, these programs will need to provision the resources from the cloud and notify Slurm of the node's name and network address and later relinquish the nodes back to the cloud. Most of the Slurm changes to support cloud scheduling were changes to support node addressing that can change.

Slurm Configuration

There are many ways to configure Slurm's use of resources. See the slurm.conf man page for more details about these options. Some general Slurm configuration parameters that are of interest include:

CommunicationParameters=NoAddrCache: By default, Slurm will cache a node's network address after successfully establishing the node's network address. This option disables the cache and Slurm will look up the node's network address each time a connection is made. This is useful, for example, in a cloud environment where the node addresses come and go out of DNS.
ResumeFailProgram: The program that will be executed when nodes fail to resume by ResumeTimeout. The argument to the program will be the names of the failed nodes (using Slurm's hostlist expression format).
ResumeProgram: The program executed when a node has been allocated and should be made available for use.If the slurmd daemon fails to respond within the configured SlurmdTimeout value with an updated BootTime, the node will be placed in a DOWN state and the job requesting the node will be requeued. If the node isn't actually rebooted (i.e. when multiple-slurmd is configured) starting slurmd with "-b" option might be useful.
SlurmctldParameters=cloud_dns: By default, Slurm expects that the network addresses for cloud nodes won't be known until creation of the node and that Slurm will be notified of the node's address (e.g. scontrol update nodename=<name> nodeaddr=<addr>). Since Slurm communications rely on the node configuration found in the slurm.conf, Slurm will tell the client command, after waiting for all nodes to boot, each node's IP address. However, in environments where the nodes are in DNS, this step can be avoided by configuring this option.
SlurmctldParameters=idle_on_node_suspend: Mark nodes as idle, regardless of current state, when suspending nodes with SuspendProgram so that nodes will be eligible to be resumed at a later time.
SuspendExcNodes: Nodes not subject to suspend/resume logic. This may be used to avoid suspending and resuming nodes which are not in the cloud. Alternately the suspend/resume programs can treat local nodes differently from nodes being provisioned from cloud.
SuspendProgram: The program executed when a node is no longer required and can be relinquished to the cloud.
SuspendTime: The time interval that a node will be left idle or down before a request is made to relinquish it. Units are in seconds.
TCPTimeout: Time permitted for TCP connections to be established. This value may need to be increased from the default (2 seconds) to account for latency between your local site and machines in the cloud.
TreeWidth: Since the slurmd daemons are not aware of the network addresses of other nodes in the cloud, the slurmd daemons on each node should be sent messages directly and not forward those messages between each other. To do so, configure TreeWidth to a number at least as large as the maximum node count. The value may not exceed 65533.

Some node parameters that are of interest include:

Feature: A node feature can be associated with resources acquired from the cloud and user jobs can specify their preference for resource use with the "--constraint" option.
NodeName: This is the name by which Slurm refers to the node. A name containing a numeric suffix is recommended for convenience. The NodeAddr and NodeHostname should not be set, but will be configured later using scripts.
State: Nodes which are to be added on demand should have a state of "CLOUD". These nodes will not actually appear in Slurm commands until after they are configured for use.
Weight: Each node can be configured with a weight indicating the desirability of using that resource. Nodes with lower weights are used before those with higher weights.

Nodes to be acquired on demand can be placed into their own Slurm partition. This mode of operation can be used to use these nodes only if so requested by the user. Note that jobs can be submitted to multiple partitions and will use resources from whichever partition permits faster initiation. A sample configuration in which nodes are added from the cloud when the workload exceeds available resources. Users can explicitly request local resources or resources from the cloud by using the "--constraint" option.

# Excerpt of slurm.conf
SelectType=select/cons_res
SelectTypeParameters=CR_CORE_Memory

SuspendProgram=/usr/sbin/slurm_suspend
ResumeProgram=/usr/sbin/slurm_suspend
SuspendTime=600
SuspendExcNodes=tux[0-127]
TreeWidth=128

NodeName=DEFAULT    Sockets=1 CoresPerSocket=4 ThreadsPerCore=2
NodeName=tux[0-127] Weight=1 Feature=local State=UNKNOWN
NodeName=ec[0-127]  Weight=8 Feature=cloud State=CLOUD
PartitionName=debug MaxTime=1:00:00 Nodes=tux[0-32] Default=yes
PartitionName=batch MaxTime=8:00:00 Nodes=tux[0-127],ec[0-127] Default=no

Operational Details

When the slurmctld daemon starts, all nodes with a state of CLOUD will be included in its internal tables, but these node records will not be seen with user commands or used by applications until allocated to some job. After allocated, the ResumeProgram is executed and should do the following:

Boot the node
Configure and start Munge (depends upon configuration)
Install the Slurm configuration file, slurm.conf, on the node. Note that configuration file will generally be identical on all nodes and not include NodeAddr or NodeHostname configuration parameters for any nodes in the cloud. Slurm commands executed on this node only need to communicate with the slurmctld daemon on the SlurmctldHost.
Notify the slurmctld daemon of the node's hostname and network address:
scontrol update nodename=ec0 nodeaddr=123.45.67.89 nodehostname=whatever
Note that the node address and hostname information set by the scontrol command are preserved when the slurmctld daemon is restarted unless the "-c" (cold-start) option is used.
Start the slurmd daemon on the node

The SuspendProgram only needs to relinquish the node back to the cloud.

An environment variable SLURM_NODE_ALIASES contains sets of node name, communication address and hostname. The variable is set by salloc, sbatch, and srun. It is then used by srun to determine the destination for job launch communication messages. This environment variable is only set for nodes allocated from the cloud. If a job is allocated some resources from the local cluster and others from the cloud, only those nodes from the cloud will appear in SLURM_NODE_ALIASES. Each set of names and addresses is comma separated and the elements within the set are separated by colons. For example:
SLURM_NODE_ALIASES=ec0:123.45.67.8:foo,ec2:123.45.67.9:bar

Remaining Work

We need scripts to provision resources from EC2.
The SLURM_NODE_ALIASES environment variable needs to change if a job expands (adds resources).
Some MPI implementations will not work due to the node naming.
Some tests in Slurm's test suite fail.

Last modified 9 June 2021