Slurm on Managed GPU Cluster
The Managed GPU Cluster is built on the open-source K8s platform, automates the deployment, scaling, and management of containerized applications. It fully integrates components such as Container Orchestration, Storage, Networking, Security, and PaaS, providing customers with an optimal environment for developing and deploying applications on the Cloud.
FPT Cloud manages all control-plane components, while users deploy and manage the Worker Nodes. This allows users to focus on deploying applications without spending resources on managing the Kubernetes Cluster.
Based on the open-source Kubernetes platform, Managed GPU Cluster automates the deployment, scaling, and management of containerized applications. In addition to full integration of Container Orchestration, Storage, Networking, Security, and PaaS components, it also provides GPU resources to support complex computing tasks.
Things to note before using the Managed GPU Cluster
Cluster location: The geographic region can affect access speed during usage. You should select the Region closest to the traffic source to optimize performance.
Number of Nodes and their configurations: Every account is assigned certain quotas for resources like RAM, GPU, CPU, Storage, IPs, etc. Therefore, customers should determine the number of resources needed and the maximum limits required so FPT Cloud can provide the best support.
Overview
Slurm is a powerful open-source platform used for cluster resource management and job scheduling. It is designed to optimize performance and efficiency for supercomputers and large computer clusters. Its core components work together to ensure high performance and flexibility. The diagram below illustrates how Slurm operates.

slurmctld: The controller daemon of Slurm. Considered the “brain” of the system, it monitors cluster resources, schedules jobs, and manages cluster states. For higher reliability, a secondary slurmctld can be configured to avoid service interruption if the primary controller fails, ensuring high availability.
slurmd: The node daemon of Slurm. Deployed on every compute node, it receives commands from slurmctld and manages job execution, including job launching, reporting job status, and preparing for upcoming jobs. slurmd communicates directly with compute resources and forms the basis of job scheduling.
slurmdbd: The database daemon of Slurm. Although optional, it is crucial for long-term management and auditing in large clusters, as it maintains a centralized database for job history and accounting information. It can aggregate data from multiple Slurm-managed clusters, simplifying and improving data management efficiency.
Slurm CLI: provides commands to manage jobs and monitor the system:
scontrol: manage the cluster and control cluster configurations
squeue: query job status in the queue
srun: submit and manage jobs
sbatch: submit batch jobs for scheduled and resource-managed execution
sinfo: query overall cluster status, including node availability
Why Slurm on K8s?
Both Slurm and Kubernetes can serve as workload management systems for distributed model training and HPC (High-Performance Computing) in general.
Each system has its own strengths and weaknesses, with significant trade-offs. Slurm provides advanced scheduling, efficiency, fine-grained hardware control, and accounting capabilities, but lacks general-purpose flexibility. Conversely, Kubernetes can be used for many workloads beyond training (e.g., inferencing) and offers excellent auto-scaling and self-healing.
Unfortunately, there is currently no straightforward way to combine the benefits of both systems. And because many large tech companies use Kubernetes as the default infrastructure layer without support for specialized training systems, some ML engineers simply do not have a choice.
Using Slurm on Kubernetes allows us to leverage Kubernetes’ auto-scaling and self-healing capabilities within Slurm, while introducing unique features—all while retaining the familiar interaction model of the Slurm ecosystem.
Slurm on Managed GPU Cluster
The Slurm Operator uses the custom resource SlurmCluster (CR) to define the configuration files required for managing Slurm clusters and solving issues related to control-plane management. This helps simplify the deployment and maintenance of clusters managed by Slurm. The figure below illustrates the architecture of Slurm on the FPTCloud Managed GPU/K8s cluster. A cluster administrator can deploy and manage a Slurm cluster through SlurmCluster. The Slurm Operator will create the control components of Slurm inside the cluster based on SlurmCluster. A Slurm configuration file may be mounted into the control component through a shared volume or a ConfigMap.

In the Slurm-on-K8s deployment model, the components of a Slurm cluster such as login nodes, worker nodes, etc are represented as pods on Kubernetes. At the same time, in this model, the concept of a shared-root volume is implemented, simply understood as deploying a shared filesystem—this filesystem is equivalent to the filesystem of an OS. Every job, after being sent to a worker node, will be executed inside this shared-root environment.
This ensures that every worker node always has identical configurations, packages, and state without manual management. In other words, when you install packages on one node, those packages automatically appear on the remaining nodes.
All you need to do is define your desired Slurm cluster in the Slurm cluster custom resource. The Slurm Operator will perform the deployment and management of the Slurm cluster for you according to the state you defined in the Slurm cluster CR.
Deploy a Slurm Cluster on K8s
Requirements
The K8s cluster must support dynamic provisioning volume and still have available storage quota.
At least one StorageClass must be able to provide ReadWriteMany volumes.
Step-by-Step
Step 1: Install the Slurm Operator, GPU Operator, and Network Operator in the GPU software installation section and wait until all of them reach the ready state.

Step 2: In the K8s cluster, pre-create the Persistent Volumes to store the shared root space and controller node data.
Pay attention to the volumes in the Slurm-on-K8s deployment model:

Where:
jail-pvc: mounted into worker nodes and login nodes, serves as a shared sandbox, the environment where jobs are executed, and also where users operate. The size of this volume must be at least 40Gi to store the filesystem of an OS.controller-spool-pvc: Stores cluster configuration data and is mounted on the controller node.
jail-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: jail-pvc
namespace: fpt-hpc
spec:
storageClassName: default
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gicontroller-spool-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: controller-spool-pvc
namespace: fpt-hpc
spec:
storageClassName: default
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10GiNotes:
These volumes are all mandatory. They must provide ReadWriteMany and the PVC names must remain exactly as above.
For convenience in deployment, we have used dynamic provisioning volume on the FPTCloud Managed K8s product.
For production environments, we recommend mounting the root volume from a static partition belonging to a file server, to facilitate migration and maintenance of the Slurm cluster.
Step 3: Download the SlurmCluster Helm chart and configure parameters
helm repo add xplat-fke https://registry.fke.fptcloud.com/chartrepo/xplat-fke
helm repo update
helm repo list
helm search repo slurm
helm pull xplat-fke/helm-slurm-cluster --version 1.14.10 --untar=trueNote: Adjust the slurm-cluster version to match the version of Slurm Operator.
To learn more about the parameters of a Slurm cluster, we recommend you to read section 4: Parameters in the Slurm cluster
cd helm-slurm-cluster/
vi values.yamlIn the values.yaml file of the downloaded folder, you need to adjust several important fields such as:
Field
Description
slurmNodes.worker.size
Number of worker nodes
slurmNodes.worker.size.spool.volumeClaimTemplateSpec.storageClassName
StorageClass for worker-node state storage
slurmNode.login.sshRootPublicKeys
List of root user public keys for login nodes
SlurmNode.accounting.mariadbOperator.storage.volumeClaimTemplate.storageClassName
StorageClass for SlurmDBD database storage
After configuring the cluster as needed, run:
helm install fpt-hpc ./ -n fpt-hpcStep 4: Wait until all Slurm pods are in the running state
This process takes around 20 minutes when installing a Slurm cluster for the first time on a K8s cluster. It includes 2 phases: phase 1 runs setup jobs, and phase 2 installs Slurm components.
When all components are ready, find the login node’s public IP using:
kubectl get svc -n fpt-hpc | grep loginSSH into the Slurm cluster head node:
ssh root@<IP_login_svc>If using nodeshell:
chroot /mnt/jail
sudo -iRun tests:
srun --nodes=2 --gres=gpu:1 nvidia-smi -L
salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00 --gres=gpu:1Run a Sample Job on the Slurm Cluster
After logging in successfully to the Slurm cluster, you can verify its operation by training the minGPT model following the steps below:
Step 1: Clone the pytorch/examples repository
mkdir /shared
cd /shared
git clone https://github.com/pytorch/examplesStep 2: Navigate to the minGPT-ddp folder & install the necessary packages
cd examples/distributed/minGPT-ddp
pip3 install -r requirements.txt
pip3 install numpyDue to the shared-root mechanism, we only need to run this once; these packages will automatically sync to all remaining worker nodes.
Note: In production environments, we recommend using a conda environment/container to create a training environment instead of installing packages directly into the global environment.
Step 3: Edit the Slurm script
vi mingpt/slurm/sbatch_run.shNote: Adjust the path to the main.py file inside sbatch_run.sh to the actual path in the mingpt folder.
Step 4: Run the sample Slurm job
sbatch mingpt/slurm/sbatch_run.shStep 5: Check status
squeue
scontrol show job <<job_id>>
cat <<log.out>>Parameters in the Slurm cluster
In section 3 of the guide for running Slurm on K8s, we guided you to adjust the most important parameters. In this section, we will go deeper into understanding the parameters/attributes defined for a Slurm cluster; you can also read the comments in the values.yaml file of the Slurm cluster custom resource downloaded in section 4 for additional information.
Attribute
Sample value
Description
clusterName
"fpt-hpc"
cluster name (note: do not change)
k8sNodeFilters
N/A
Divides the K8s cluster into two lists: GPU nodes (to deploy slurm workers) and non-GPU nodes to deploy other components. In case the cluster only has GPU nodes, these two lists can be the same.
volumeSources
volumeSources:
- name: controller-spool
persistentVolumeClaim:
claimName: "controller-spool-pvc"
readOnly: false
- name: jail
persistentVolumeClaim:
claimName: "jail-pvc"
readOnly: false
Defines the PersistentVolumeClaims used by the containers representing the components (worker, login, controller nodes, …) of the Slurm cluster.
periodicChecks
N/A
A periodic job to check the status of a node. If that node contains a GPU with issues, drain that node.
summonses
N/A
Defines the number and configuration of component nodes in a Slurm cluster (login node, worker node, …)
slurmNodes.accounting
enabled: true
mariadbOperator:
enabled: true
resources:
cpu: "1000m"
memory: "1Gi"
ephemeralStorage: "5Gi"
replicas: 1
replication: {}
storage:
ephemeral: false
volumeClaimTemplate:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: xplat-nfs
Configuration of accounting node. Here we use the mariadb operator to create the database; you can also use an external database (read more in values.yaml).
slurmNodes.controller
size: 1
k8sNodeFilterName: "no-gpu"
slurmctld:
port: 6817
resources:
cpu: "1000m"
memory: "3Gi"
ephemeralStorage: "20Gi"
munge:
resources:
cpu: "1000m"
memory: "1Gi"
ephemeralStorage: "5Gi"
volumes:
spool:
volumeSourceName: "controller-spool"
jail:
volumeSourceName: "jail"
Configuration of the controller node, with 1 controller node + mounting 2 volumes (spool & jail shared root space) into this node.
slurmNodes.worker
size: 8
k8sNodeFilterName: "gpu"
cgroupVersion: v2
slurmd:
port: 6818
resources:
cpu: "110000m"
memory: "1220Gi"
ephemeralStorage: "55Gi"
gpu: 8
rdma: 1
munge:
resources:
cpu: "2000m"
memory: "4Gi"
ephemeralStorage: "5Gi"
volumes:
spool:
volumeClaimTemplateSpec:
storageClassName: "xplat-nfs"
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: "120Gi"
jail:
volumeSourceName: "jail"
Configuration of worker nodes, with 8 worker nodes, each node has 8 GPUs + mounts the jail (shared root space) volume into this node.
slurmNodes.login
login:
size: 2
k8sNodeFilterName: "no-gpu"
sshd:
port: 22
resources:
cpu: "3000m"
memory: "9Gi"
ephemeralStorage: "30Gi"
sshRootPublicKeys:
- "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHke7B5+kGXx/Dwr76NI5KxfAAEkqcxbh6/8SV7tnpUP [email protected]"
sshdServiceLoadBalancerIP: ""
sshdServiceNodePort: 30022
munge:
resources:
cpu: "500m"
memory: "500Mi"
ephemeralStorage: "5Gi"
volumes:
jail:
volumeSourceName: "jail"
Configuration of login nodes, with 2 login nodes, exposing the sshd service using the load-balancer service type in K8s, using the public key defined in sshRootPublicKeys for the root user, and mounting the same volumes as the controller node & worker node.
slurmNodes.exporter
exporter:
enabled: true
size: 1
k8sNodeFilterName: "no-gpu"
exporter:
resources:
cpu: "250m"
memory: "256Mi"
ephemeralStorage: "500Mi"
munge:
resources:
cpu: "1000m"
memory: "1Gi"
ephemeralStorage: "5Gi"
volumes:
jail:
volumeSourceName: "jail"
Install the node exporter for monitoring.
For more detailed information, please read the comments in the values.yaml file defining the Slurm cluster configuration.
Common use cases
Add user/login
Add users: To add an SSH key for root, you simply need to edit the Slurm cluster CR:
kubectl edit SlurmCluster fpt-hpc -n fpt-hpcIn the login node configuration section, navigate to the sshRootPublicKeys attribute and add your desired public key.
To add a regular user, you do the same as adding a user on a Linux host:
sudo adduser <<user_name>>Modify login settings
By default, we expose the login node through a public Load Balancer. This may not be suitable for some requirements. Therefore, you can change the LB type to private, use the port-forward mechanism to access the Slurm cluster, or customize it according to your needs at our portal + LB node.
Scale up/down worker nodes
To edit the number of worker nodes specifically and the number of other types of nodes in general, we simply need to edit the Slurm cluster CR:
kubectl edit SlurmCluster fpt-hpc -n fpt-hpcIn the worker nodes configuration section, navigate to the “size” field and edit the number of worker nodes as desired.
Notes:
When scaling up the number of worker nodes, the new node will be automatically added to the list of worker nodes in the Slurm controller node and be ready to run jobs.
When scaling down nodes, you need to manually delete the node on the Slurm controller using the command:
scontrol delete nodeName=<>The node list in a cluster will always be: worker-[0, (size - 1)].
Migrate Slurm cluster to another K8s cluster
Thanks to the flexibility of K8s and network file storage, we can easily move a Slurm cluster from one K8s cluster to another. What needs to be done is mounting & recreating the jail-pvc on the new Slurm K8s cluster and performing the steps to create the Slurm K8s cluster again.
Mounting external volumes into the Slurm cluster
To mount a volume into the Slurm cluster, you need to create that volume first, then deploy it as a PV & PVC on K8s. The following example uses dynamic provisioning to create this PV/PVC (in production environments, we recommend using static provisioning volumes to ensure data safety).
Step 1: Create PVC
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: jail-submount-mlperf-sd-pvc
spec:
storageClassName: default
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100GiStep 2: Declare this volume in the Slurm cluster
Edit the volumeSource field in the Slurm cluster CR:
kubectl edit SlurmCluster fpt-hpc -n fpt-hpcvolumeSources:
- name: controller-spool
persistentVolumeClaim:
claimName: "controller-spool-pvc"
readOnly: false
- name: jail
persistentVolumeClaim:
claimName: "jail-pvc"
readOnly: false
- name: mlperf-sd
persistentVolumeClaim:
claimName: "jail-submount-mlperf-sd-pvc"
readOnly: falseStep 3: Mount these volumes into the login and worker nodes in the Slurm cluster
In the login node:
volumes:
jail:
volumeSourceName: "jail"
jailSubMounts:
- name: "mlcommons-sd-bench-data"
mountPath: "/mnt/data-hps"
volumeSourceName: "mlperf-sd"In the worker node:
volumes:
spool:
volumeClaimTemplateSpec:
storageClassName: "xplat-nfs"
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: "120Gi"
jail:
volumeSourceName: "jail"
jailSubMounts:
- name: "mlcommons-sd-bench-data"
mountPath: "/mnt/data-hps"
volumeSourceName: "mlperf-sd"Note: The mount path must be the same across all worker nodes and login nodes.
Last updated
