GPU Sharing

GPU sharing modes allow physical GPUs to be shared by multiple containers to optimize GPU utilization. The following GPU sharing strategies are supported:

Multi-instance GPU

GPU time-sharing

NVIDIA MPS

General

The GPU is divided and shared among multiple containers

Each container uses the GPU in a time slice

Containers use the GPU in parallel

Isolation

A GPU can be divided into up to seven instances, each instance having its own dedicated compute, memory, and bandwidth. Each partition is fully isolated from each other.

Each container accesses the full capacity of the underlying physical GPU by performing context switching between processes running on the GPU. However, time-sharing does not enforce memory limits between shared jobs, and rapid context switching for shared access may introduce overhead.

NVIDIA MPS has limited resource isolation, but gains more flexibility in other dimensions, such as GPU types and maximum shared units, which simplify resource allocation.

Multi-instance GPU

GPU time-sharing

NVIDIA MPS

Suitable for these workloads

Recommended for workloads running in parallel that require certain resiliency and QoS. For example, when running AI inference workloads, multi-instance GPU allows multiple inference queries to run simultaneously for quick responses, without slowing each other down.

Recommended for bursty and interactive workloads with idle periods. These workloads are not cost-effective with a fully dedicated GPU. By using time-sharing, workloads get quick access to the GPU during active phases. GPU time-sharing is optimal for scenarios to avoid idling costly GPUs where full isolation and continuous GPU access might not be necessary, for example, when multiple users test or prototype workloads. Workloads using time-sharing must tolerate certain performance and latency trade-offs.

Recommended for batch processing for small jobs because MPS maximizes throughput and concurrent GPU utilization. MPS enables batch jobs to efficiently process in parallel for small to medium-sized workloads. NVIDIA MPS is optimal for cooperative processes acting as a single application. For example, MPI jobs with inter-MPI rank parallelism. With these jobs, each small CUDA process (typically MPI ranks) can run concurrently on the GPU to fully saturate the entire GPU. Workloads that use CUDA MPS need to tolerate the memory protection and error containment limitations.

Multi-Instance GPU (MIG)

Multi-Instance GPU is a feature that allows your GPU to be divided into up to 7 separate parts. These GPU parts are called MIG instances, and these MIG instances are completely isolated from each other in terms of computing power, bandwidth, and memory.

FPT supports the following MIG profiles:

No.

GPU H100 SXM5

Strategy

Number instance

Instance resource

all-1g.10gb

single

1g.10gb

all-1g.20gb

single

1g.20gb

all-2g.20gb

single

2g.20gb

all-3g.40gb

single

3g.40gb

all-4g.40gb

single

4g.40gb

all-7g.80gb

single

7g.80gb

all-balanced

mixed

2 1 1

1g.10gb 2g.20gb 3g.40gb

none (no label)

none

0 (Entire)

No.

GPU H200 SXM5

Strategy

Number instance

Instance resource

all-1g.18gb

single

1g.18gb

all-1g.35gb

single

1g.35gb

all-2g.25gb

single

2g.25gb

all-3g.71gb

single

3g.71gb

all-4g.71gb

single

4g.71gb

all-7g.141gb

single

7g.141gb

all-balanced

mixed

2 1 1

1g.18gb 2g.35gb 3g.71gb

none (no label)

none

0 (Entire)

No.

GPU A100 Profile

Strategy

Number instance

Instance resource

all-1g.10gb

single

1g.10gb

all-1g.20gb

single

4g.20gb

all-2g.20gb

single

2g.20gb

all-3g.40gb

single

3g.40gb

all-4g.40gb

single

4g.40gb

all-balanced

mixed

2 1 1

1g.10gb 2g.20gb 3g.40gb

none with operator

none

0 (Entire GPU)

none

No.

GPU A30 Profile

Strategy

Number of instances

Instance resource

all-1g.6gb

single

1g.6gb

all-2g.12gb

single

2G.10GB

All-4G.24GB

Single

4G.24GB

all-balanced

Mixed

2 1

1 GB RAM, 6 GB storage 2 GB RAM, 12 GB storage

Example: If you select the single strategy configuration: all-1g.6gb, the A30 GPU card on the worker is divided into 4 MIG devices with GPU resources equal to ¼ of the physical GPU and 6GB of GPU RAM.

Notes

The MIG configuration applies to all cards installed on the worker.
The MIG strategy on worker groups within the same cluster must be the same type (single/mixed/none).
For the "none with Operator" strategy, the pod can use 1 GPU device containing the resources of the entire GPU.
For the "none" strategy, the GPU is already connected to the machine, and users can deploy the GPU Operator or GPU device plugin according to their desired configuration. Users are advised to have a solid understanding of GPU-Sharing basics before implementing this strategy!

MIG configuration

When creating a GPU worker group, you can select MIG sharing mode profiles on the interface, and our GPU Kubernetes service will configure it for you:

Notes

If you select profiles of the "MIG single" type, your subsequent worker groups can only choose sharing modes belonging to profiles of the "MIG single" type. The same applies to the "MIG mixed", "None", and "None with Operator" profiles.
The "None" sharing mode corresponds to us leaving full control of the Kubernetes GPU cluster to you. You can manually install the GPU Operator or Nvidia device plugin to run sharing modes as needed.
The "None with operator" sharing mode corresponds to us managing the GPU Operator for you. However, one GPU can only be assigned to a maximum of one container at a time.
Verify MIG: After our portal system reports the cluster as successful, you can check the GPU resources of a GPU node using the command:

Kubectl describe nodes

Output:

At this point, you can request up to 4 nvidia.com/gpu resources for your pod, with each nvidia.com/gpu resource corresponding to ¼ of the original physical GPU's computing power and memory.

If your node uses 2 GPUs, 8 nvidia.com/gpu resources will be displayed.

Additionally, you can combine MIG with other GPU sharing strategies such as time slicing (already supported) and MPS (not yet supported) to maximize GPU utilization.

Multi Process Service (MPS)

MPS is a feature in NVIDIA GPUs that allows multiple containers to share the same physical GPU.
MPS has an advantage over MIG in terms of GPU resource allocation, with up to 48 containers able to use the GPU simultaneously.
MPS is based on NVIDIA's Multi-Process Service feature of CUDA, allowing multiple CUDA applications to run simultaneously on a single GPU.
With MPS, users can predefine the number of replicas for a GPU. This value indicates the maximum number of containers that can access and use a GPU.
Additionally, we can limit GPU resources for each container by creating the following environment variables in the container:

CUDA_MPS_ACTIVE_THREAD_PERCENTAGE CUDA_MPS_PINNED_DEVICE_MEM_LIMIT

To better understand how MPS works, please visit: https://docs.nvidia.com/deploy/mps/

MPS configuration

You can configure your GPU worker group to use the GPU during worker group initialization as illustrated below:

With this configuration, the GPU will be "split" into 48 parts, each with 1/48th of the original physical GPU's computing power and memory.

Verify MPS: You can check the MPS configuration on your GPU node using the command:

kubectl describe nodes $NODE_NAME

Output:

At this time, you can request up to 48 nvidia.com/gpu resources for your pods, with each nvidia.com/gpu resource corresponding to 1/48th of the compute and memory capacity of the original physical GPU

If your node uses 2 GPUs, 96 nvidia.com/gpu resources will be displayed.

Notes

The nvidia.com/gpu resource requested by a container must be 1.
The maximum number of clients is 48, the minimum is 2, and physical GPU resources are evenly distributed among the maximum clients.
One container runs one process to ensure that the MPS sharing mode does not generate errors.
Require the "hostIPC:true" setting in the workload deployment manifest file.
MPS has limitations regarding error containment and workload isolation; please research and consider these before using it.

Time Slicing

Timeslicing is a primitive GPU sharing feature, where each process/container uses the GPU for an equal amount of time.
Timeslicing implements GPU sharing through the context switching mechanism in the CPU, where each process/container saves its context when the GPU is used by another process.
Timeslicing does not support parallel GPU sharing like MPS.
- Configuring time slicing on a Kubernetes GPU service

Time slicing is a native GPU sharing feature that can be enabled across all MIG sharing modes (except MIG-mixed profiles) and the "None with Operator" mode.

When creating a GPU worker group, you can choose to combine timeslicing with MIG or use timeslicing on the GPU with MIG mode enabled. We will configure this for you:

Verify Time Slicing: You can check the timeslicing configuration on your GPU node using the command:

kubectl describe nodes $NODE_NAME

Output:

At this time, you can request up to 48 nvidia.com/gpu resources for your pods. However, unlike MPS, each pod is not limited in the amount of resources it can consume, which can lead to memory overflow.

If you use MIG mode, the number of nvidia.com/gpu resources equals the number of MIG instances * the maximum number of Time Slicing clients you define. For example: if you use MIG mode 2x2g.12gb and the number of timeslicing clients is 48, 96 nvidia.com/gpu resources will be displayed.

Notes

The nvidia.com/gpu resource for a container request can be equal to or greater than 1. However, requesting more than 1 nvidia.com/gpu resource does not grant your container access to more resources.
When you use timeslicing, containers are not limited in their use of compute and memory resources.
The maximum number of clients is 48, and the minimum is 2.
A container runs one process.
Clearly define the amount of GPU container resources needed to avoid OOM causing GPU operation interruptions.

PreviousInstall GPU drivers NextCluster Auto-Scaling

Last updated 2 days ago