Advanced Resource Reservation Guide
Slurm has the ability to reserve resources for jobs being executed by select users and/or select bank accounts. A resource reservation identifies the resources in that reservation and a time period during which the reservation is available. The resources which can be reserved include cores, nodes, licenses and/or burst buffers. A reservation that contains nodes or cores is associated with one partition, and can't span resources over multiple partitions. The only exception from this is when the reservation is created with explicitly requested nodes. Note that resource reservations are not compatible with Slurm's gang scheduler plugin since the termination time of running jobs cannot be accurately predicted.
Note that reserved burst buffers and licenses are treated somewhat differently than reserved cores or nodes. When cores or nodes are reserved, then jobs using that reservation can use only those resources (this behavior can be change using FLEX flag) and no other jobs can use those resources. Reserved burst buffers and licenses can only be used by jobs associated with that reservation, but licenses not explicitly reserved are available to any job. This eliminates the need to explicitly put licenses into every advanced reservation created.
Reservations can be created, updated, or destroyed only by user root or the configured SlurmUser using the scontrol command. The scontrol and sview commands can be used to view reservations. The man pages for the various commands contain details.
Reservation Creation
One common mode of operation for a reservation would be to reserve an entire computer at a particular time for a system down time. The example below shows the creation of a full-system reservation at 16:00 hours on 6 February and lasting for 120 minutes. The "maint" flag is used to identify the reservation for accounting purposes as system maintenance. The "ignore_jobs" flag is used to indicate that we can ignore currently running jobs when creating this reservation. By default, only resources which are not expected to have a running job at the start time can be reserved (the time limit of all running jobs will have been reached). In this case we can manually cancel the running jobs as needed to perform system maintenance. As the reservation time approaches, only jobs that can complete by the reservation time will be initiated.
$ scontrol create reservation starttime=2009-02-06T16:00:00 \ duration=120 user=root flags=maint,ignore_jobs nodes=ALL Reservation created: root_3 $ scontrol show reservation ReservationName=root_3 StartTime=2009-02-06T16:00:00 EndTime=2009-02-06T18:00:00 Duration=120 Nodes=ALL NodeCnt=20 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES,IGNORE_JOBS Licenses=(null) BurstBuffers=(null) Users=root Accounts=(null)
A variation of this would be to configure license to represent system resources, such as a global file system. The system resource may not require an actual license for use, but Slurm licenses can be used to prevent jobs needed the resource from being started when that resource is unavailable. One could create a reservation for all of those licenses in order to perform maintenance on that resource. In the example below, we create a reservation for 1000 licenses with the name of "lustre". If there are a total of 1000 lustre licenses configured in this cluster, this reservation will prevent any job specifying the need for needed a lustre license from being scheduled on this cluster during this reservation.
$ scontrol create reservation starttime=2009-04-06T16:00:00 \ duration=120 user=root flags=license_only \ licenses=lustre:1000 Reservation created: root_4 $ scontrol show reservation ReservationName=root_4 StartTime=2009-04-06T16:00:00 EndTime=2009-04-06T18:00:00 Duration=120 Nodes= NodeCnt=0 Features=(null) PartitionName=(null) Flags=LICENSE_ONLY Licenses=lustre*1000 BurstBuffers=(null) Users=root Accounts=(null)
Another mode of operation would be to reserve specific nodes for an indefinite period in order to study problems on those nodes. This could also be accomplished using a Slurm partition specifically for this purpose, but that would fail to capture the maintenance nature of their use.
$ scontrol create reservation user=root starttime=now \ duration=infinite flags=maint nodes=sun000 Reservation created: root_5 $ scontrol show res ReservationName=root_5 StartTime=2009-02-04T16:22:57 EndTime=2009-02-04T16:21:57 Duration=4294967295 Nodes=sun000 NodeCnt=1 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES Licenses=(null) BurstBuffers=(null) Users=root Accounts=(null)
Our next example is to reserve ten nodes in the default Slurm partition starting at noon and with a duration of 60 minutes occurring daily. The reservation will be available only to users "alan" and "brenda".
$ scontrol create reservation user=alan,brenda \ starttime=noon duration=60 flags=daily nodecnt=10 Reservation created: alan_6 $ scontrol show res ReservationName=alan_6 StartTime=2009-02-05T12:00:00 EndTime=2009-02-05T13:00:00 Duration=60 Nodes=sun[000-003,007,010-013,017] NodeCnt=10 Features=(null) PartitionName=pdebug Flags=DAILY Licenses=(null) BurstBuffers=(null) Users=alan,brenda Accounts=(null)
Our next example is to reserve 100GB of burst buffer space starting at noon today and with a duration of 60 minutes. The reservation will be available only to users "alan" and "brenda".
$ scontrol create reservation user=alan,brenda \ starttime=noon duration=60 flags=any_nodes burstbuffer=100GB Reservation created: alan_7 $ scontrol show res ReservationName=alan_7 StartTime=2009-02-05T12:00:00 EndTime=2009-02-05T13:00:00 Duration=60 Nodes= NodeCnt=0 Features=(null) PartitionName=(null) Flags=ANY_NODES Licenses=(null) BurstBuffer=100GB Users=alan,brenda Accounts=(null)
Reservations can be optimized with respect to system topology if the reservation request includes information about the sizes of jobs to be created. This is especially important for BlueGene systems due to restrictive rules about the topology of created blocks (due to hardware constraints and/or Slurm's configuration). To take advantage of this optimization, specify the sizes of jobs of to be concurrently executed. The example below creates a reservation containing 4096 c-nodes on a BlueGene system so that two 2048 c-node jobs can execute simultaneously.
$ scontrol create reservation user=alan,brenda \ starttime=noon duration=60 nodecnt=2k,2k Reservation created: alan_8 $ scontrol show res ReservationName=alan_8 StartTime=2011-12-05T12:00:00 EndTime=2011-12-05T13:00:00 Duration=60 Nodes=bgp[000x011,210x311] NodeCnt=4096 Features=(null) PartitionName=pdebug Flags= Licenses=(null) BurstBuffers=(null) Users=alan,brenda Accounts=(null)
Note that specific nodes to be associated with the reservation are identified immediately after creation of the reservation. This permits users to stage files to the nodes in preparation for use during the reservation. Note that the reservation creation request can also identify the partition from which to select the nodes or _one_ feature that every selected node must contain.
On a smaller system, one might want to reserve cores rather than
whole nodes.
This capability permits the administrator to identify the core count to be
reserved on each node as shown in the examples below.
NOTE: Core reservations are not available when the system is configured
to use the select/linear plugin.
# Create a two core reservation for user alan $ scontrol create reservation StartTime=now Duration=60 \ NodeCnt=1 CoreCnt=2 User=alan # Create a reservation for user brenda with two cores on # node tux8 and 4 cores on node tux9 $ scontrol create reservation StartTime=now Duration=60 \ Nodes=tux8,tux9 CoreCnt=2,4 User=brenda
Reservations can not only be created for the use of specific accounts and users, but specific accounts and/or users can be prevented from using them. In the following example, a reservation is created for account "foo", but user "alan" is prevented from using the reservation even when using the account "foo".
$ scontrol create reservation account=foo \ user=-alan partition=pdebug \ starttime=noon duration=60 nodecnt=2k,2k Reservation created: alan_9 $ scontrol show res ReservationName=alan_9 StartTime=2011-12-05T13:00:00 EndTime=2011-12-05T14:00:00 Duration=60 Nodes=bgp[000x011,210x311] NodeCnt=4096 Features=(null) PartitionName=pdebug Flags= Licenses=(null) BurstBuffers=(null) Users=-alan Accounts=foo
When creating a reservation, you can request that Slurm include all the nodes in a partition by specifying the PartitionName option. If you only want a certain number of nodes or CPUs from that partition you can combine PartitionName with the CoreCnt, NodeCnt or TRES options to specify how many of a resource you want. In the following example, a reservation is created in the 'gpu' partition that uses the TRES option to limit the reservation to 24 processors, divided among 4 nodes.
$ scontrol create reservationname=test start=now duration=1 \ user=user1 partition=gpu tres=cpu=24,node=4 Reservation created: test $ scontrol show res ReservationName=test StartTime=2020-08-28T11:07:09 EndTime=2020-08-28T11:08:09 Duration=00:01:00 Nodes=node[01-04] NodeCnt=4 CoreCnt=24 Features=(null) PartitionName=gpu NodeName=node01 CoreIDs=0-5 NodeName=node02 CoreIDs=0-5 NodeName=node03 CoreIDs=0-5 NodeName=node04 CoreIDs=0-5 TRES=cpu=24 Users=user1 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null)
Reservation Use
The reservation create response includes the reservation's name. This name is automatically generated by Slurm based upon the first user or account name and a numeric suffix. In order to use the reservation, the job submit request must explicitly specify that reservation name. The job must be contained completely within the named reservation. The job will be canceled after the reservation reaches its EndTime. If letting the job continue execution after the reservation EndTime, a configuration option ResvOverRun in slurm.conf can be set to control how long the job can continue execution.
$ sbatch --reservation=alan_6 -N4 my.script sbatch: Submitted batch job 65540
Note that use of a reservation does not alter a job's priority, but it does act as an enhancement to the job's priority. Any job with a reservation is considered for scheduling to resources before any other job in the same Slurm partition (queue) not associated with a reservation.
Reservation Modification
Reservations can be modified by user root as desired. For example their duration could be altered or the users granted access changed as shown below:
$ scontrol update ReservationName=root_3 \ duration=150 users=admin Reservation updated. bash-3.00$ scontrol show ReservationName=root_3 ReservationName=root_3 StartTime=2009-02-06T16:00:00 EndTime=2009-02-06T18:30:00 Duration=150 Nodes=ALL NodeCnt=20 Features=(null) PartitionName=(null) Flags=MAINT,SPEC_NODES Licenses=(null) BurstBuffers=(null) Users=admin Accounts=(null)
Reservation Deletion
Reservations are automatically purged after their end time. They may also be manually deleted as shown below. Note that a reservation can not be deleted while there are jobs running in it.
$ scontrol delete ReservationName=alan_6
NOTE: By default, when a reservation ends the reservation request will be removed from any pending jobs submitted to the reservation and will be put into a held state. Use the NO_HOLD_JOBS_AFTER_END reservation flag to let jobs run outside of the reservation after the reservation is gone.
Overlapping Reservations
By default, reservations must not overlap. They must either include different nodes or operate at different times. If specific nodes are not specified when a reservation is created, Slurm will automatically select nodes to avoid overlap and ensure that the selected nodes are available when the reservation begins.
There is very limited support for overlapping reservations with two specific modes of operation available. For ease of system maintenance, you can create a reservation with the "maint" flag that overlaps existing reservations. This permits an administrator to easily create a maintenance reservation for an entire cluster without needing to remove or reschedule pre-existing reservations. Users requesting access to one of these pre-existing reservations will be prevented from using resources that are also in this maintenance reservation. For example, users alan and brenda might have a reservation for some nodes daily from noon until 1PM. If there is a maintenance reservation for all nodes starting at 12:30PM, the only jobs they may start in their reservation would have to be completed by 12:30PM, when the maintenance reservation begins.
The second exception operates in the same manner as a maintenance reservation except that is it not logged in the accounting system as nodes reserved for maintenance. It requires the use of the "overlap" flag when creating the second reservation. This might be used to ensure availability of resources for a specific user within a group having a reservation. Using the previous example of alan and brenda having a 10 node reservation for 60 minutes, we might want to reserve 4 nodes of that for brenda during the first 30 minutes of the time period. In this case, the creation of one overlapping reservation (for a total of two reservations) may be simpler than creating three separate reservations, partly since the use of any reservation requires the job specification of the reservation name.
- A six node reservation for both alan and brenda that lasts the full 60 minutes
- A four node reservation for brenda for the first 30 minutes
- A four node reservation for both alan and brenda that lasts for the final 30 minutes
If the "maint" or "overlap" flag is used when creating reservations, one could create a reservation within a reservation within a third reservation. Note a reservation having a "maint" or "overlap" flag will not have resources removed from it by a subsequent reservation also having a "maint" or "overlap" flag, so nesting of reservations only works to a depth of two.
Reservations Floating Through Time
Slurm can be used to create an advanced reservation with a start time that remains a fixed period of time in the future. These reservation are not intended to run jobs, but to prevent long running jobs from being initiated on specific nodes. That node might be placed in a DRAINING state to prevent any new jobs from being started there. Alternately, an advanced reservation might be placed on the node to prevent jobs exceeding some specific time limit from being started. Attempts by users to make use of a reservation with a floating start time will be rejected. When ready to perform the maintenance, place the node in DRAINING state and delete the previously created advanced reservation.
Create the reservation by using the flag value of TIME_FLOAT and a start time that is relative to the current time (use the keyword now). The reservation duration should generally be a value which is large relative to typical job run times in order to not adversely impact backfill scheduling decisions. Alternately the reservation can have a specific end time, in which case the reservation's start time will increase through time until the reservation's end time is reached. When the current time passes the reservation end time then the reservation will be purged. In the example below, node tux8 is prevented from starting any jobs exceeding a 60 minute time limit. The duration of this reservation is 100 (minutes).
$ scontrol create reservation user=operator nodes=tux8 \ starttime=now+60minutes duration=100 flags=time_float
Reservations that Replace Allocated Resources
Slurm can create an advanced reservation for which nodes which are allocated to jobs are automatically replaced with new idle nodes. The effect of this is to always maintain a constant size pool of resources. This is accomplished by using the REPLACE flag as shown in the example below. This option is not supported for reservations specifying cores which span more than one node, rather than full nodes. (E.g. a 1 core reservation on node "tux1" will be moved if node "tux1" goes down, but a reservation containing 2 cores on node "tux1" and 3 cores on "tux2" will not be moved if "tux1" goes down.)
$ scontrol create reservation starttime=now duration=60 \ users=foo nodecnt=2 flags=replace Reservation created: foo_82 $ scontrol show res ReservationName=foo_82 StartTime=2014-11-20T16:21:11 EndTime=2014-11-20T17:21:11 Duration=01:00:00 Nodes=tux[0-1] NodeCnt=2 CoreCnt=12 Features=(null) PartitionName=debug Flags=REPLACE Users=jette Accounts=(null) Licenses=(null) State=ACTIVE $ sbatch -n4 --reservation=foo_82 tmp Submitted batch job 97 $ scontrol show res ReservationName=foo_82 StartTime=2014-11-20T16:21:11 EndTime=2014-11-20T17:21:11 Duration=01:00:00 Nodes=tux[1-2] NodeCnt=2 CoreCnt=12 Features=(null) PartitionName=debug Flags=REPLACE Users=jette Accounts=(null) Licenses=(null) State=ACTIVE $ sbatch -n4 --reservation=foo_82 tmp Submitted batch job 98 $ scontrol show res ReservationName=foo_82 StartTime=2014-11-20T16:21:11 EndTime=2014-11-20T17:21:11 Duration=01:00:00 Nodes=tux[2-3] NodeCnt=2 CoreCnt=12 Features=(null) PartitionName=debug Flags=REPLACE Users=jette Accounts=(null) Licenses=(null) State=ACTIVE $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 97 debug tmp foo R 0:09 1 tux0 98 debug tmp foo R 0:07 1 tux1
NOTE: If you want the reservation to maintain the size you specify, but not grow as the reservation is used you can use the REPLACE_DOWN flag. Nodes that are DOWN or DRAINED will be replaced, but not nodes that have jobs running on them.
FLEX Reservations
By default, jobs that run in reservations must fit within the time and size constraints of the reserved resources. With the FLEX flag jobs are able to start before the reservation begins or continue after it ends. They are also able to use the reserved node(s) along with additional nodes if required and available.
The default behavior for jobs that request a reservation is that they must be able to run within the confines (time and space) of that reservation. The following example shows that the FLEX flag allows the job to run before the reservation starts, after it ends, and on a node outside of the reservation.
$ scontrol create reservation user=user1 nodes=node01 starttime=now+10minutes duration=10 flags=flex Reservation created: user1_831 $ sbatch -wnode0[1-2] -t30:00 --reservation=user1_831 test.job Submitted batch job 57996 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 57996 debug sleepjob user1 R 0:08 2 node[01-02]
Magnetic Reservations
The default behavior for reservations is that jobs must request a reservation in order to run in it. The MAGNETIC flag allows you to create a reservation that will allow jobs to run in it without requiring that they specify the name of the reservation. The reservation will only "attract" jobs that meet the access control requirements.
The following example shows a reservation created on node05. The user specified as being able to access the reservation then submits a job and the job starts on the reserved node.
$ scontrol create reservation user=user1 nodes=node05 starttime=now duration=10 flags=magnetic Reservation created: user1_850 $ scontrol show res ReservationName=user1_850 StartTime=2020-07-29T13:44:13 EndTime=2020-07-29T13:54:13 Duration=00:10:00 Nodes=node05 NodeCnt=1 CoreCnt=12 Features=(null) PartitionName=(null) Flags=SPEC_NODES,MAGNETIC TRES=cpu=12 Users=user1 Accounts=(null) Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a MaxStartDelay=(null) $ sbatch -N1 -t5:00 test.job Submitted batch job 62297 $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 62297 debug sleepjob user1 R 0:04 1 node05
Reservation Purging After Last Job
A reservation may be automatically purged after the last associated job completes. This is accomplished by using a "purge_comp" flag. Once the reservation has been created, it must be populated within 5 minutes of its start time or it will be purged before any jobs have been run.
Reservation Accounting
Jobs executed within a reservation are accounted for using the appropriate user and bank account. If resources within a reservation are not used, those resources will be accounted for as being used by all users or bank accounts associated with the reservation on an equal basis (e.g. if two users are eligible to use a reservation and neither does, each user will be reported to have used half of the reserved resources).
Prolog and Epilog
Slurm supports both a reservation prolog and epilog. They may be configured using the ResvProlog and ResvEpilog configuration parameters in the slurm.conf file. These scripts can be used to cancel jobs, modify partition configuration, etc.
Future Work
Reservations made within a partition having gang scheduling assumes the highest level rather than the actual level of time-slicing when considering the initiation of jobs. This will prevent the initiation of some jobs which would complete execution before a reservation given fewer jobs to time-slice with.
Last modified 28 August 2020