Generic Resource (GRES) Scheduling
Contents
OverviewConfiguration
Running Jobs
GPU Management
MPS Management
MIC Management
Overview
Slurm supports the ability to define and schedule arbitrary Generic RESources (GRES). Additional built-in features are enabled for specific GRES types, including Graphics Processing Units (GPUs) and CUDA Multi-Process Service (MPS) devices, through an extensible plugin mechanism.
Configuration
By default, no GRES are enabled in the cluster's configuration. You must explicitly specify which GRES are to be managed in the slurm.conf configuration file. The configuration parameters of interest are GresTypes and Gres.
For more details, see GresTypes and Gres in the slurm.conf man page.
Note that the GRES specification for each node works in the same fashion as the other resources managed. Nodes which are found to have fewer resources than configured will be placed in a DRAIN state.
Snippet from an example slurm.conf file:
# Configure four GPUs (with MPS), plus bandwidth GresTypes=gpu,mps,bandwidth NodeName=tux[0-7] Gres=gpu:tesla:2,gpu:kepler:2,mps:400,bandwidth:lustre:no_consume:4G
Each compute node with generic resources typically contain a gres.conf file describing which resources are available on the node, their count, associated device files and cores which should be used with those resources.
There are cases where you may want to define a Generic Resource on a node without specifying a quantity of that GRES. For example, the filesystem type of a node doesn't decrease in value as jobs run on that node. You can use the no_consume flag to allow users to request a GRES without having a defined count that gets used as it is requested.
If AutoDetect=nvml is set in gres.conf, and the NVIDIA Management Library (NVML) is installed on the node and was found during Slurm configuration, configuration details will automatically be filled in for any system-detected NVIDIA GPU. This removes the need to explicitly configure GPUs in gres.conf, though the Gres= line in slurm.conf is still required in order to tell slurmctld how many GRES to expect.
By default, all system-detected devices are added to the node. However, if Type and File in gres.conf match a GPU on the system, any other properties explicitly specified (e.g. Cores or Links) can be double-checked against it. If the system-detected GPU differs from its matching GPU configuration, then the GPU is omitted from the node with an error. This allows gres.conf to serve as an optional sanity check and notifies administrators of any unexpected changes in GPU properties.
To view available gres.conf configuration parameters, see the gres.conf man page.
Example gres.conf file:
# Configure four GPUs (with MPS), plus bandwidth AutoDetect=nvml Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1 Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1 Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3 Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3 Name=mps Count=200 File=/dev/nvidia0 Name=mps Count=200 File=/dev/nvidia1 Name=mps Count=100 File=/dev/nvidia2 Name=mps Count=100 File=/dev/nvidia3 Name=bandwidth Type=lustre Count=4G Flags=CountOnly
In this example, since AutoDetect=nvml is specified, Cores for each GPU will be checked against a corresponding GPU found on the system matching the Type and File specified. Since Links is not specified, it will be automatically filled in according to what is found on the system. If a matching system GPU is not found, no validation takes place and the GPU is assumed to be as the configuration says.
For Type to match a system-detected device, it must either exactly match or be a substring of the GPU name reported by slurmd via the AutoDetect mechanism. This GPU name will have all spaces replaced with underscores. To see the GPU name, set SlurmdDebug=debug2 in your slurm.conf and either restart or reconfigure slurmd and check the slurmd log. For example, with AutoDetect=nvml:
debug: gpu/nvml: init: init: GPU NVML plugin loaded debug2: gpu/nvml: _nvml_init: Successfully initialized NVML debug: gpu/nvml: _get_system_gpu_list_nvml: Systems Graphics Driver Version: 450.36.06 debug: gpu/nvml: _get_system_gpu_list_nvml: NVML Library Version: 11.450.36.06 debug2: gpu/nvml: _get_system_gpu_list_nvml: Total CPU count: 6 debug2: gpu/nvml: _get_system_gpu_list_nvml: Device count: 1 debug2: gpu/nvml: _get_system_gpu_list_nvml: GPU index 0: debug2: gpu/nvml: _get_system_gpu_list_nvml: Name: geforce_rtx_2060 debug2: gpu/nvml: _get_system_gpu_list_nvml: UUID: GPU-g44ef22a-d954-c552-b5c4-7371354534b2 debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Domain/Bus/Device: 0:1:0 debug2: gpu/nvml: _get_system_gpu_list_nvml: PCI Bus ID: 00000000:01:00.0 debug2: gpu/nvml: _get_system_gpu_list_nvml: NVLinks: -1 debug2: gpu/nvml: _get_system_gpu_list_nvml: Device File (minor number): /dev/nvidia0 debug2: gpu/nvml: _get_system_gpu_list_nvml: CPU Affinity Range - Machine: 0-5 debug2: gpu/nvml: _get_system_gpu_list_nvml: Core Affinity Range - Abstract: 0-5
In this example, the GPU's name is reported as
geforce_rtx_2060
. So in your slurm.conf and
gres.conf, the GPU Type can be set to
geforce
, rtx
,
2060
, geforce_rtx_2060
, or any other
substring, and slurmd should be able to match it to the system-detected
device geforce_rtx_2060
.
Running Jobs
Jobs will not be allocated any generic resources unless specifically requested at job submit time using the options:
- --gres
- Generic resources required per node
- --gpus
- GPUs required per job
- --gpus-per-node
- GPUs required per node. Equivalent to the --gres option for GPUs.
- --gpus-per-socket
- GPUs required per socket. Requires the job to specify a task socket.
- --gpus-per-task
- GPUs required per task. Requires the job to specify a task count.
All of these options are supported by the salloc, sbatch and
srun commands.
Note that all of the --gpu* options are only supported by Slurm's
select/cons_tres plugin.
Jobs requesting these options when the select/cons_tres plugin is not
configured will be rejected.
The --gres option requires an argument specifying which generic resources
are required and how many resources using the form name[:type:count]
while all of the --gpu* options require an argument of the form
[type]:count.
The name is the same name as
specified by the GresTypes and Gres configuration parameters.
type identifies a specific type of that generic resource (e.g. a
specific model of GPU).
count specifies how many resources are required and has a default
value of 1. For example:
sbatch --gres=gpu:kepler:2 ....
Several addition resource requirement specifications are available specifically for GPUs and detailed descriptions about these options are available in the man pages for the job submission commands. As for the --gpu* option, these options are only supported by Slurm's select/cons_tres plugin.
- --cpus-per-gpu
- Count of CPUs allocated per GPU.
- --gpu-bind
- Define how tasks are bound to GPUs.
- --gpu-freq
- Specify GPU frequency and/or GPU memory frequency.
- --mem-per-gpu
- Memory allocated per GPU.
Jobs will be allocated specific generic resources as needed to satisfy the request. If the job is suspended, those resources do not become available for use by other jobs.
Job steps can be allocated generic resources from those allocated to the job using the --gres option with the srun command as described above. By default, a job step will be allocated all of the generic resources allocated to the job. If desired, the job step may explicitly specify a different generic resource count than the job. This design choice was based upon a scenario where each job executes many job steps. If job steps were granted access to all generic resources by default, some job steps would need to explicitly specify zero generic resource counts, which we considered more confusing. The job step can be allocated specific generic resources and those resources will not be available to other job steps. A simple example is shown below.
#!/bin/bash # # gres_test.bash # Submit as follows: # sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash # srun --gres=gpu:2 -n2 --exclusive show_device.sh & srun --gres=gpu:1 -n1 --exclusive show_device.sh & srun --gres=gpu:1 -n1 --exclusive show_device.sh & wait
GPU Management
In the case of Slurm's GRES plugin for GPUs, the environment variable
CUDA_VISIBLE_DEVICES
is set for each job step to determine which GPUs are
available for its use on each node. This environment variable is only set
when tasks are launched on a specific compute node (no global environment
variable is set for the salloc command and the environment variable set
for the sbatch command only reflects the GPUs allocated to that job
on that node, node zero of the allocation).
CUDA version 3.1 (or higher) uses this environment
variable in order to run multiple jobs or job steps on a node with GPUs
and ensure that the resources assigned to each are unique. In the example
above, the allocated node may have four or more graphics devices. In that
case, CUDA_VISIBLE_DEVICES
will reference unique devices for each file and
the output might resemble this:
JobStep=1234.0 CUDA_VISIBLE_DEVICES=0,1 JobStep=1234.1 CUDA_VISIBLE_DEVICES=2 JobStep=1234.2 CUDA_VISIBLE_DEVICES=3
NOTE: Be sure to specify the File parameters in the gres.conf file and ensure they are in the increasing numeric order.
The CUDA_VISIBLE_DEVICES
environment variable will also be set in the job's Prolog and Epilog programs.
Note that the environment variable set for the job may differ from that set for
the Prolog and Epilog if Slurm is configured to constrain the device files
visible to a job using Linux cgroup.
This is because the Prolog and Epilog programs run outside of any Linux
cgroup while the job runs inside of the cgroup and may thus have a
different set of visible devices.
For example, if a job is allocated the device "/dev/nvidia1", then
CUDA_VISIBLE_DEVICES
will be set to a value of
"1" in the Prolog and Epilog while the job's value of
CUDA_VISIBLE_DEVICES
will be set to a
value of "0" (i.e. the first GPU device visible to the job).
For more information see the
Prolog and Epilog Guide.
When possible, Slurm automatically determines the GPUs on the system using
NVML. NVML (which powers the
nvidia-smi
tool) numbers GPUs in order by their
PCI bus IDs. For this numbering to match the numbering reported by CUDA, the
CUDA_DEVICE_ORDER
environmental variable must
be set to CUDA_DEVICE_ORDER=PCI_BUS_ID
.
GPU device files (e.g. /dev/nvidia1) are based on the Linux minor number assignment, while NVML's device numbers are assigned via PCI bus ID, from lowest to highest. Mapping between these two is nondeterministic and system dependent, and could vary between boots after hardware or OS changes. For the most part, this assignment seems fairly stable. However, an after-bootup check is required to guarantee that a GPU device is assigned to a specific device file.
Please consult the
NVIDIA CUDA documentation for more information about the
CUDA_VISIBLE_DEVICES
and
CUDA_DEVICE_ORDER
environmental variables.
MPS Management
CUDA Multi-Process Service (MPS) provides a mechanism where GPUs can be shared by multiple jobs, where each job is allocated some percentage of the GPU's resources. The total count of MPS resources available on a node should be configured in the slurm.conf file (e.g. "NodeName=tux[1-16] Gres=gpu:2,mps:200"). Several options are available for configuring MPS in the gres.conf file as listed below with examples following that:
- No MPS configuration: The count of gres/mps elements defined in the slurm.conf will be evenly distributed across all GPUs configured on the node. For the example, "NodeName=tux[1-16] Gres=gpu:2,mps:200" will configure a count of 100 gres/mps resources on each of the two GPUs.
- MPS configuration includes only the Name and Count parameters: The count of gres/mps elements will be evenly distributed across all GPUs configured on the node. This is similar to case 1, but places duplicate configuration in the gres.conf file.
- MPS configuration includes the Name, File and Count parameters: Each File parameter should identify the device file path of a GPU and the Count should identify the number of gres/mps resources available for that specific GPU device. This may be useful in a heterogeneous environment. For example, some GPUs on a node may be more powerful than others and thus be associated with a higher gres/mps count. Another use case would be to prevent some GPUs from being used for MPS (i.e. they would have an MPS count of zero).
Note that Type and Cores parameters for gres/mps are ignored. That information is copied from the gres/gpu configuration.
Note the Count parameter is translated to a percentage, so the value would typically be a multiple of 100.
Note that if NVIDIA's NVML library is installed, the GPU configuration (i.e. Type, File, Cores and Links data) will be automatically gathered from the library and need not be recorded in the gres.conf file.
Note the same GPU can be allocated either as a GPU type of GRES or as an MPS type of GRES, but not both. In other words, once a GPU has been allocated as a gres/gpu resource it will not be available as a gres/mps. Likewise, once a GPU has been allocated as a gres/mps resource it will not be available as a gres/gpu. However the same GPU can be allocated as MPS generic resources to multiple jobs belonging to multiple users, so long as the total count of MPS allocated to jobs does not exceed the configured count. Some example configurations for Slurm's gres.conf file are shown below.
# Example 1 of gres.conf # Configure four GPUs (with MPS) AutoDetect=nvml Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1 Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1 Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3 Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3 # Set gres/mps Count value to 100 on each of the 4 available GPUs Name=mps Count=400
# Example 2 of gres.conf # Configure four different GPU types (with MPS) AutoDetect=nvml Name=gpu Type=gtx1080 File=/dev/nvidia0 Cores=0,1 Name=gpu Type=gtx1070 File=/dev/nvidia1 Cores=0,1 Name=gpu Type=gtx1060 File=/dev/nvidia2 Cores=2,3 Name=gpu Type=gtx1050 File=/dev/nvidia3 Cores=2,3 Name=mps Count=1300 File=/dev/nvidia0 Name=mps Count=1200 File=/dev/nvidia1 Name=mps Count=1100 File=/dev/nvidia2 Name=mps Count=1000 File=/dev/nvidia3
NOTE: gres/mps requires the use of the select/cons_tres plugin.
Job requests for MPS will be processed the same as any other GRES except that the request must be satisfied using only one GPU per node and only one GPU per node may be configured for use with MPS. For example, a job request for "--gres=mps:50" will not be satisfied by using 20 percent of one GPU and 30 percent of a second GPU on a single node. Multiple jobs from different users can use MPS on a node at the same time. Note that GRES types of GPU and MPS can not be requested within a single job. Also jobs requesting MPS resources can not specify a GPU frequency.
A prolog program should be used to start and stop MPS servers as needed.
A sample prolog script to do this is included with the Slurm distribution in
the location etc/prolog.example.
Its mode of operation is if a job is allocated gres/mps resources then the
Prolog will have the CUDA_VISIBLE_DEVICES
,
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
, and
SLURM_JOB_UID
environment variables set.
The Prolog should then make sure that an MPS server is started for that GPU
and user (UID == User ID).
It also records the GPU device ID in a local file.
If a job is allocated gres/gpu resources then the Prolog will have the
CUDA_VISIBLE_DEVICES
and
SLURM_JOB_UID
environment variables set
(no CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
).
The Prolog should then terminate any MPS server associated with that GPU.
It may be necessary to modify this script as needed for the local environment.
For more information see the
Prolog and Epilog Guide.
Jobs requesting MPS resources will have the
CUDA_VISIBLE_DEVICES
and CUDA_DEVICE_ORDER
environment variables set.
The device ID is relative to those resources under MPS server control and will
always have a value of zero in the current implementation (only one GPU will be
usable in MPS mode per node).
The job will also have the
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
environment variable set to that job's percentage of MPS resources available on
the assigned GPU.
The percentage will be calculated based upon the portion of the configured
Count on the Gres is allocated to a job of step.
For example, a job requesting "--gres=gpu:200" and using
configuration example 2 above would be
allocated
15% of the gtx1080 (File=/dev/nvidia0, 200 x 100 / 1300 = 15), or
16% of the gtx1070 (File=/dev/nvidia0, 200 x 100 / 1200 = 16), or
18% of the gtx1060 (File=/dev/nvidia0, 200 x 100 / 1100 = 18), or
20% of the gtx1050 (File=/dev/nvidia0, 200 x 100 / 1000 = 20).
An alternate mode of operation would be to permit jobs to be allocated whole GPUs then trigger the starting of an MPS server based upon comments in the job. For example, if a job is allocated whole GPUs then search for a comment of "mps-per-gpu" or "mps-per-node" in the job (using the "scontrol show job" command) and use that as a basis for starting one MPS daemon per GPU or across all GPUs respectively.
Please consult the NVIDIA Multi-Process Service documentation for more information about MPS.
Note that a vulnerability exists in previous versions of the NVIDIA driver that may affect users when sharing GPUs. More information can be found in CVE-2018-6260 and in the Security Bulletin: NVIDIA GPU Display Driver - February 2019.
MIC Management
Slurm can be used to provide resource management for systems with the
Intel® Many Integrated Core (MIC) processor.
Slurm sets an OFFLOAD_DEVICES
environment variable, which controls the
selection of MICs available to a job step.
The OFFLOAD_DEVICES
environment variable is used by both Intel
LEO (Language Extensions for Offload) and the MKL (Math Kernel Library)
automatic offload.
(This is very similar to how the
CUDA_VISIBLE_DEVICES
environment variable is
used to control which GPUs can be used by CUDA™ software.)
If no MICs are reserved via GRES, the
OFFLOAD_DEVICES
variable is set to
-1. This causes the code to ignore the offload directives and run MKL
routines on the CPU. The code will still run but only on the CPU. This
also gives a somewhat cryptic warning:
offload warning: OFFLOAD_DEVICES device number -1 does not correspond to a physical device
The offloading is automatically scaled to all the devices, (e.g. if --gres=mic:2 is defined) then all offloads use two MICs unless explicitly defined in the offload pragmas.
Last modified 17 June 2021