Install GPU drivers

Users can install their preferred GPU driver on the FPT Kubernetes Engine cluster with integrated GPU support.

Step 1: Create a GPU Cluster with Driver Installation set to User-Install

Create a cluster with Driver Installation set to User-Install

Step 2: Customers install the software required to use the GPU (Driver, Toolkit, Device Plugin, etc.)

Refer to the GPU driver versions:

Release Notes: https://docs.nvidia.com/datacenter/tesla/index.html https://docs.nvidia.com/datacenter/tesla/drivers/releases.json
Document: https://docs.nvidia.com/datacenter/tesla/drivers/index.html
Installer: https://download.nvidia.com/XFree86/Linux-x86_64/

Customers can refer to the DaemonSet Driver installation below:

# Copyright 2023 FPT Cloud - PaaS
# worker.fptcloud/type=gpu

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fptcloud-gpu-driver-installer
  namespace: kube-system
  labels:
    k8s-app: gpu-driver
spec:
  selector:
    matchLabels:
      k8s-app: gpu-driver
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-driver-installer
        k8s-app: gpu-driver
    spec:
      priorityClassName: system-node-critical
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: worker.fptcloud/type
                operator: In
                values: ["gpu"]
      tolerations:
      - operator: "Exists"
      containers:
        - image: docker.io/alpine:3.13
          name: nvidia-driver-installer
          command:
            - 'nsenter'
            - '-t'
            - '1'
            - '-m'
            - '-u'
            - '-i'
            - '-n'
            - '--'
            - 'bash'
            - '-l'
            - '-c'
            - 'curl -Ls https://raw.githubusercontent.com/fci-xplat/fke-config/main/fptcloud-gpu-driver-installer.sh | bash -s -- -p admin'
          resources:
            requests:
              cpu: 150m
          env:
          - name: NVIDIA_DRIVER_VERSION
            value: "535.54.03"
          - name: NVIDIA_TOOLKIT_INSTALL
            value: "true"
          imagePullPolicy: IfNotPresent
          securityContext:
            privileged: true
            allowPrivilegeEscalation: true
      hostPID: true
      hostNetwork: true
      hostIPC: true

With environment variable parameters:

NVIDIA_DRIVER_VERSION: Driver version
NVIDIA_TOOLKIT_INSTALL: "true" or "false", default is "true". Automatically install the toolkit or not.

To apply the fptcloud DaemonSet to the K8s cluster, use the following command:

kubectl apply -f https://raw.githubusercontent.com/fci-xplat/fke-config/main/fptcloud-gpu-driver-installer.yaml

Check the status of the DaemonSet's Pods

kubectl get pod -n kube-system | grep "gpu-driver"

NAME                                                 READY   STATUS    RESTARTS        AGE
fptcloud-gpu-driver-installer-7tj55                  1/1     Running   0               2d17h

The DaemonSet fptcloud-gpu-driver-installer will schedule pods on all workers in the Worker Group (with the label worker.fptcloud/type: gpu) to install the Driver/Toolkit.

Check the logs of the fptcloud-gpu-driver-installer-7tj55 pod to see if the Installer has finished installing.

kubectl logs fptcloud-gpu-driver-installer-7tj55 -n kube-system

If the installation is successful, you will see logs as follows. The installation process usually takes a few minutes.

Verifying Nvidia installation... DONE. 
Clean Nvidia installation... DONE.

PreviousDeploy applications NextGPU Sharing

Last updated 3 days ago