Slurm on Managed GPU Cluster

The Managed GPU Cluster is built on the open-source K8s platform, automates the deployment, scaling, and management of containerized applications. It fully integrates components such as Container Orchestration, Storage, Networking, Security, and PaaS, providing customers with an optimal environment for developing and deploying applications on the Cloud.

FPT Cloud manages all control-plane components, while users deploy and manage the Worker Nodes. This allows users to focus on deploying applications without spending resources on managing the Kubernetes Cluster.

Based on the open-source Kubernetes platform, Managed GPU Cluster automates the deployment, scaling, and management of containerized applications. In addition to full integration of Container Orchestration, Storage, Networking, Security, and PaaS components, it also provides GPU resources to support complex computing tasks.

Things to note before using the Managed GPU Cluster

  • Cluster location: The geographic region can affect access speed during usage. You should select the Region closest to the traffic source to optimize performance.

  • Number of Nodes and their configurations: Every account is assigned certain quotas for resources like RAM, GPU, CPU, Storage, IPs, etc. Therefore, customers should determine the number of resources needed and the maximum limits required so FPT Cloud can provide the best support.

Overview

Slurm is a powerful open-source platform used for cluster resource management and job scheduling. It is designed to optimize performance and efficiency for supercomputers and large computer clusters. Its core components work together to ensure high performance and flexibility. The diagram below illustrates how Slurm operates.

  • slurmctld: The controller daemon of Slurm. Considered the “brain” of the system, it monitors cluster resources, schedules jobs, and manages cluster states. For higher reliability, a secondary slurmctld can be configured to avoid service interruption if the primary controller fails, ensuring high availability.

  • slurmd: The node daemon of Slurm. Deployed on every compute node, it receives commands from slurmctld and manages job execution, including job launching, reporting job status, and preparing for upcoming jobs. slurmd communicates directly with compute resources and forms the basis of job scheduling.

  • slurmdbd: The database daemon of Slurm. Although optional, it is crucial for long-term management and auditing in large clusters, as it maintains a centralized database for job history and accounting information. It can aggregate data from multiple Slurm-managed clusters, simplifying and improving data management efficiency.

Slurm CLI: provides commands to manage jobs and monitor the system:

  • scontrol: manage the cluster and control cluster configurations

  • squeue: query job status in the queue

  • srun: submit and manage jobs

  • sbatch: submit batch jobs for scheduled and resource-managed execution

  • sinfo: query overall cluster status, including node availability

Why Slurm on K8s?

Both Slurm and Kubernetes can serve as workload management systems for distributed model training and HPC (High-Performance Computing) in general.

Each system has its own strengths and weaknesses, with significant trade-offs. Slurm provides advanced scheduling, efficiency, fine-grained hardware control, and accounting capabilities, but lacks general-purpose flexibility. Conversely, Kubernetes can be used for many workloads beyond training (e.g., inferencing) and offers excellent auto-scaling and self-healing.

Unfortunately, there is currently no straightforward way to combine the benefits of both systems. And because many large tech companies use Kubernetes as the default infrastructure layer without support for specialized training systems, some ML engineers simply do not have a choice.

Using Slurm on Kubernetes allows us to leverage Kubernetes’ auto-scaling and self-healing capabilities within Slurm, while introducing unique features—all while retaining the familiar interaction model of the Slurm ecosystem.

Slurm on Managed GPU Cluster

The Slurm Operator uses the custom resource SlurmCluster (CR) to define the configuration files required for managing Slurm clusters and solving issues related to control-plane management. This helps simplify the deployment and maintenance of clusters managed by Slurm. The figure below illustrates the architecture of Slurm on the FPTCloud Managed GPU/K8s cluster. A cluster administrator can deploy and manage a Slurm cluster through SlurmCluster. The Slurm Operator will create the control components of Slurm inside the cluster based on SlurmCluster. A Slurm configuration file may be mounted into the control component through a shared volume or a ConfigMap.

In the Slurm-on-K8s deployment model, the components of a Slurm cluster such as login nodes, worker nodes, etc are represented as pods on Kubernetes. At the same time, in this model, the concept of a shared-root volume is implemented, simply understood as deploying a shared filesystem—this filesystem is equivalent to the filesystem of an OS. Every job, after being sent to a worker node, will be executed inside this shared-root environment.

This ensures that every worker node always has identical configurations, packages, and state without manual management. In other words, when you install packages on one node, those packages automatically appear on the remaining nodes.

All you need to do is define your desired Slurm cluster in the Slurm cluster custom resource. The Slurm Operator will perform the deployment and management of the Slurm cluster for you according to the state you defined in the Slurm cluster CR.

Deploy a Slurm Cluster on K8s

Requirements

  • The K8s cluster must support dynamic provisioning volume and still have available storage quota.

  • At least one StorageClass must be able to provide ReadWriteMany volumes.

Step-by-Step

Step 1: Install the Slurm Operator, GPU Operator, and Network Operator in the GPU software installation section and wait until all of them reach the ready state.

Step 2: In the K8s cluster, pre-create the Persistent Volumes to store the shared root space and controller node data.

Pay attention to the volumes in the Slurm-on-K8s deployment model:

Where:

  • jail-pvc: mounted into worker nodes and login nodes, serves as a shared sandbox, the environment where jobs are executed, and also where users operate. The size of this volume must be at least 40Gi to store the filesystem of an OS.

  • controller-spool-pvc: Stores cluster configuration data and is mounted on the controller node.

jail-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jail-pvc
  namespace: fpt-hpc
spec:
  storageClassName: default
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi

controller-spool-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: controller-spool-pvc
  namespace: fpt-hpc
spec:
  storageClassName: default
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

Notes:

  • These volumes are all mandatory. They must provide ReadWriteMany and the PVC names must remain exactly as above.

  • For convenience in deployment, we have used dynamic provisioning volume on the FPTCloud Managed K8s product.

  • For production environments, we recommend mounting the root volume from a static partition belonging to a file server, to facilitate migration and maintenance of the Slurm cluster.

Step 3: Download the SlurmCluster Helm chart and configure parameters

helm repo add xplat-fke https://registry.fke.fptcloud.com/chartrepo/xplat-fke
helm repo update
helm repo list
helm search repo slurm
helm pull xplat-fke/helm-slurm-cluster --version 1.14.10 --untar=true

Note: Adjust the slurm-cluster version to match the version of Slurm Operator.

To learn more about the parameters of a Slurm cluster, we recommend you to read section 4: Parameters in the Slurm cluster

cd helm-slurm-cluster/
vi values.yaml

In the values.yaml file of the downloaded folder, you need to adjust several important fields such as:

Field

Description

slurmNodes.worker.size

Number of worker nodes

slurmNodes.worker.size.spool.volumeClaimTemplateSpec.storageClassName

StorageClass for worker-node state storage

slurmNode.login.sshRootPublicKeys

List of root user public keys for login nodes

SlurmNode.accounting.mariadbOperator.storage.volumeClaimTemplate.storageClassName

StorageClass for SlurmDBD database storage

After configuring the cluster as needed, run:

helm install fpt-hpc ./ -n fpt-hpc

Step 4: Wait until all Slurm pods are in the running state

This process takes around 20 minutes when installing a Slurm cluster for the first time on a K8s cluster. It includes 2 phases: phase 1 runs setup jobs, and phase 2 installs Slurm components.

When all components are ready, find the login node’s public IP using:

kubectl get svc -n fpt-hpc | grep login

SSH into the Slurm cluster head node:

ssh root@<IP_login_svc>

If using nodeshell:

chroot /mnt/jail
sudo -i

Run tests:

srun --nodes=2 --gres=gpu:1 nvidia-smi -L
salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00 --gres=gpu:1

Run a Sample Job on the Slurm Cluster

After logging in successfully to the Slurm cluster, you can verify its operation by training the minGPT model following the steps below:

Step 1: Clone the pytorch/examples repository

mkdir /shared
cd /shared
git clone https://github.com/pytorch/examples

Step 2: Navigate to the minGPT-ddp folder & install the necessary packages

cd examples/distributed/minGPT-ddp
pip3 install -r requirements.txt
pip3 install numpy

Due to the shared-root mechanism, we only need to run this once; these packages will automatically sync to all remaining worker nodes.

Note: In production environments, we recommend using a conda environment/container to create a training environment instead of installing packages directly into the global environment.

Step 3: Edit the Slurm script

vi mingpt/slurm/sbatch_run.sh

Note: Adjust the path to the main.py file inside sbatch_run.sh to the actual path in the mingpt folder.

Step 4: Run the sample Slurm job

sbatch mingpt/slurm/sbatch_run.sh

Step 5: Check status

squeue
scontrol show job <<job_id>>
cat <<log.out>>

Parameters in the Slurm cluster

In section 3 of the guide for running Slurm on K8s, we guided you to adjust the most important parameters. In this section, we will go deeper into understanding the parameters/attributes defined for a Slurm cluster; you can also read the comments in the values.yaml file of the Slurm cluster custom resource downloaded in section 4 for additional information.

Attribute

Sample value

Description

clusterName

"fpt-hpc"

cluster name (note: do not change)

k8sNodeFilters

N/A

Divides the K8s cluster into two lists: GPU nodes (to deploy slurm workers) and non-GPU nodes to deploy other components. In case the cluster only has GPU nodes, these two lists can be the same.

volumeSources

volumeSources:

- name: controller-spool

persistentVolumeClaim:

claimName: "controller-spool-pvc"

readOnly: false

- name: jail

persistentVolumeClaim:

claimName: "jail-pvc"

readOnly: false

Defines the PersistentVolumeClaims used by the containers representing the components (worker, login, controller nodes, …) of the Slurm cluster.

periodicChecks

N/A

A periodic job to check the status of a node. If that node contains a GPU with issues, drain that node.

summonses

N/A

Defines the number and configuration of component nodes in a Slurm cluster (login node, worker node, …)

slurmNodes.accounting

enabled: true

mariadbOperator:

enabled: true

resources:

cpu: "1000m"

memory: "1Gi"

ephemeralStorage: "5Gi"

replicas: 1

replication: {}

storage:

ephemeral: false

volumeClaimTemplate:

accessModes:

- ReadWriteOnce

resources:

requests:

storage: 10Gi

storageClassName: xplat-nfs

Configuration of accounting node. Here we use the mariadb operator to create the database; you can also use an external database (read more in values.yaml).

slurmNodes.controller

size: 1

k8sNodeFilterName: "no-gpu"

slurmctld:

port: 6817

resources:

cpu: "1000m"

memory: "3Gi"

ephemeralStorage: "20Gi"

munge:

resources:

cpu: "1000m"

memory: "1Gi"

ephemeralStorage: "5Gi"

volumes:

spool:

volumeSourceName: "controller-spool"

jail:

volumeSourceName: "jail"

Configuration of the controller node, with 1 controller node + mounting 2 volumes (spool & jail shared root space) into this node.

slurmNodes.worker

size: 8

k8sNodeFilterName: "gpu"

cgroupVersion: v2

slurmd:

port: 6818

resources:

cpu: "110000m"

memory: "1220Gi"

ephemeralStorage: "55Gi"

gpu: 8

rdma: 1

munge:

resources:

cpu: "2000m"

memory: "4Gi"

ephemeralStorage: "5Gi"

volumes:

spool:

volumeClaimTemplateSpec:

storageClassName: "xplat-nfs"

accessModes: ["ReadWriteOnce"]

resources:

requests:

storage: "120Gi"

jail:

volumeSourceName: "jail"

Configuration of worker nodes, with 8 worker nodes, each node has 8 GPUs + mounts the jail (shared root space) volume into this node.

slurmNodes.login

login:

size: 2

k8sNodeFilterName: "no-gpu"

sshd:

port: 22

resources:

cpu: "3000m"

memory: "9Gi"

ephemeralStorage: "30Gi"

sshRootPublicKeys:

- "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHke7B5+kGXx/Dwr76NI5KxfAAEkqcxbh6/8SV7tnpUP [email protected]"

sshdServiceLoadBalancerIP: ""

sshdServiceNodePort: 30022

munge:

resources:

cpu: "500m"

memory: "500Mi"

ephemeralStorage: "5Gi"

volumes:

jail:

volumeSourceName: "jail"

Configuration of login nodes, with 2 login nodes, exposing the sshd service using the load-balancer service type in K8s, using the public key defined in sshRootPublicKeys for the root user, and mounting the same volumes as the controller node & worker node.

slurmNodes.exporter

exporter:

enabled: true

size: 1

k8sNodeFilterName: "no-gpu"

exporter:

resources:

cpu: "250m"

memory: "256Mi"

ephemeralStorage: "500Mi"

munge:

resources:

cpu: "1000m"

memory: "1Gi"

ephemeralStorage: "5Gi"

volumes:

jail:

volumeSourceName: "jail"

Install the node exporter for monitoring.

For more detailed information, please read the comments in the values.yaml file defining the Slurm cluster configuration.

Common use cases

Add user/login

  • Add users: To add an SSH key for root, you simply need to edit the Slurm cluster CR:

kubectl edit SlurmCluster fpt-hpc -n fpt-hpc
  • In the login node configuration section, navigate to the sshRootPublicKeys attribute and add your desired public key.

  • To add a regular user, you do the same as adding a user on a Linux host:

sudo adduser <<user_name>>
  • Modify login settings

By default, we expose the login node through a public Load Balancer. This may not be suitable for some requirements. Therefore, you can change the LB type to private, use the port-forward mechanism to access the Slurm cluster, or customize it according to your needs at our portal + LB node.

Scale up/down worker nodes

To edit the number of worker nodes specifically and the number of other types of nodes in general, we simply need to edit the Slurm cluster CR:

kubectl edit SlurmCluster fpt-hpc -n fpt-hpc

In the worker nodes configuration section, navigate to the “size” field and edit the number of worker nodes as desired.

Notes:

  • When scaling up the number of worker nodes, the new node will be automatically added to the list of worker nodes in the Slurm controller node and be ready to run jobs.

  • When scaling down nodes, you need to manually delete the node on the Slurm controller using the command:

scontrol delete nodeName=<>

The node list in a cluster will always be: worker-[0, (size - 1)].

Migrate Slurm cluster to another K8s cluster

Thanks to the flexibility of K8s and network file storage, we can easily move a Slurm cluster from one K8s cluster to another. What needs to be done is mounting & recreating the jail-pvc on the new Slurm K8s cluster and performing the steps to create the Slurm K8s cluster again.

Mounting external volumes into the Slurm cluster

To mount a volume into the Slurm cluster, you need to create that volume first, then deploy it as a PV & PVC on K8s. The following example uses dynamic provisioning to create this PV/PVC (in production environments, we recommend using static provisioning volumes to ensure data safety).

Step 1: Create PVC

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jail-submount-mlperf-sd-pvc
spec:
  storageClassName: default
  accessModes:
    - ReadWriteMany
  resources:
     requests:
        storage: 100Gi

Step 2: Declare this volume in the Slurm cluster

Edit the volumeSource field in the Slurm cluster CR:

kubectl edit SlurmCluster fpt-hpc -n fpt-hpc
volumeSources:
  - name: controller-spool
    persistentVolumeClaim:
      claimName: "controller-spool-pvc"
      readOnly: false
  - name: jail
    persistentVolumeClaim:
      claimName: "jail-pvc"
      readOnly: false
  - name: mlperf-sd
    persistentVolumeClaim:
      claimName: "jail-submount-mlperf-sd-pvc"
      readOnly: false

Step 3: Mount these volumes into the login and worker nodes in the Slurm cluster

In the login node:

volumes:
  jail:
    volumeSourceName: "jail"
  jailSubMounts:
    - name: "mlcommons-sd-bench-data"
      mountPath: "/mnt/data-hps"
      volumeSourceName: "mlperf-sd"

In the worker node:

volumes:
  spool:
    volumeClaimTemplateSpec:
      storageClassName: "xplat-nfs"
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: "120Gi"
  jail:
    volumeSourceName: "jail"
  jailSubMounts:
    - name: "mlcommons-sd-bench-data"
      mountPath: "/mnt/data-hps"
      volumeSourceName: "mlperf-sd"

Note: The mount path must be the same across all worker nodes and login nodes.

Last updated