# Slurm on Managed GPU Cluster

The Managed GPU Cluster is built on the open-source K8s platform, automates the deployment, scaling, and management of containerized applications. It fully integrates components such as Container Orchestration, Storage, Networking, Security, and PaaS, providing customers with an optimal environment for developing and deploying applications on the Cloud.

FPT Cloud manages all control-plane components, while users deploy and manage the Worker Nodes. This allows users to focus on deploying applications without spending resources on managing the Kubernetes Cluster.

Based on the open-source Kubernetes platform, Managed GPU Cluster automates the deployment, scaling, and management of containerized applications. In addition to full integration of Container Orchestration, Storage, Networking, Security, and PaaS components, it also provides GPU resources to support complex computing tasks.

Things to note before using the Managed GPU Cluster

* Cluster location: The geographic region can affect access speed during usage. You should select the Region closest to the traffic source to optimize performance.
* Number of Nodes and their configurations: Every account is assigned certain quotas for resources like RAM, GPU, CPU, Storage, IPs, etc. Therefore, customers should determine the number of resources needed and the maximum limits required so FPT Cloud can provide the best support.

## Overview

Slurm is a powerful open-source platform used for cluster resource management and job scheduling. It is designed to optimize performance and efficiency for supercomputers and large computer clusters. Its core components work together to ensure high performance and flexibility. The diagram below illustrates how Slurm operates.

![](/files/RJfVVWqcus3gx5UdAKDg)

* slurmctld:\
  &#x20;The controller daemon of Slurm. Considered the “brain” of the system, it monitors cluster resources, schedules jobs, and manages cluster states. For higher reliability, a secondary slurmctld can be configured to avoid service interruption if the primary controller fails, ensuring high availability.
* slurmd:\
  &#x20;The node daemon of Slurm. Deployed on every compute node, it receives commands from slurmctld and manages job execution, including job launching, reporting job status, and preparing for upcoming jobs. slurmd communicates directly with compute resources and forms the basis of job scheduling.
* slurmdbd:\
  &#x20;The database daemon of Slurm. Although optional, it is crucial for long-term management and auditing in large clusters, as it maintains a centralized database for job history and accounting information. It can aggregate data from multiple Slurm-managed clusters, simplifying and improving data management efficiency.

Slurm CLI: provides commands to manage jobs and monitor the system:

* scontrol: manage the cluster and control cluster configurations
* squeue: query job status in the queue
* srun: submit and manage jobs
* sbatch: submit batch jobs for scheduled and resource-managed execution
* sinfo: query overall cluster status, including node availability

## Why Slurm on K8s?

Both Slurm and Kubernetes can serve as workload management systems for distributed model training and HPC (High-Performance Computing) in general.

Each system has its own strengths and weaknesses, with significant trade-offs. Slurm provides advanced scheduling, efficiency, fine-grained hardware control, and accounting capabilities, but lacks general-purpose flexibility. Conversely, Kubernetes can be used for many workloads beyond training (e.g., inferencing) and offers excellent auto-scaling and self-healing.

Unfortunately, there is currently no straightforward way to combine the benefits of both systems. And because many large tech companies use Kubernetes as the default infrastructure layer without support for specialized training systems, some ML engineers simply do not have a choice.

Using Slurm on Kubernetes allows us to leverage Kubernetes’ auto-scaling and self-healing capabilities within Slurm, while introducing unique features—all while retaining the familiar interaction model of the Slurm ecosystem.

## Slurm on Managed GPU Cluster

The Slurm Operator uses the custom resource SlurmCluster (CR) to define the configuration files required for managing Slurm clusters and solving issues related to control-plane management. This helps simplify the deployment and maintenance of clusters managed by Slurm. The figure below illustrates the architecture of Slurm on the FPTCloud Managed GPU/K8s cluster. A cluster administrator can deploy and manage a Slurm cluster through SlurmCluster. The Slurm Operator will create the control components of Slurm inside the cluster based on SlurmCluster. A Slurm configuration file may be mounted into the control component through a shared volume or a ConfigMap.

<figure><img src="/files/ClWjLmTA6exmw83OrEe2" alt=""><figcaption></figcaption></figure>

In the Slurm-on-K8s deployment model, the components of a Slurm cluster such as login nodes, worker nodes, etc are represented as pods on Kubernetes. At the same time, in this model, the concept of a shared-root volume is implemented, simply understood as deploying a shared filesystem—this filesystem is equivalent to the filesystem of an OS. Every job, after being sent to a worker node, will be executed inside this shared-root environment.&#x20;

This ensures that every worker node always has identical configurations, packages, and state without manual management. In other words, when you install packages on one node, those packages automatically appear on the remaining nodes.

All you need to do is define your desired Slurm cluster in the Slurm cluster custom resource. The Slurm Operator will perform the deployment and management of the Slurm cluster for you according to the state you defined in the Slurm cluster CR.

### Deploy a Slurm Cluster on K8s

#### **Requirements**

* The K8s cluster must support dynamic provisioning volume and still have available storage quota.
* At least one StorageClass must be able to provide ReadWriteMany volumes.

#### **Step-by-Step**

**Step 1:** Install the Slurm Operator, GPU Operator, and Network Operator in the GPU software installation section and wait until all of them reach the ready state.

<figure><img src="/files/xU9LpuxTjNZWfvpoUffD" alt=""><figcaption></figcaption></figure>

**Step 2:** In the K8s cluster, pre-create the Persistent Volumes to store the shared root space and controller node data.

Pay attention to the volumes in the Slurm-on-K8s deployment model:

<figure><img src="/files/MarZts4Jg11XXoXoFn8E" alt=""><figcaption></figcaption></figure>

Where:

* `jail-pvc`: mounted into worker nodes and login nodes, serves as a shared sandbox, the environment where jobs are executed, and also where users operate. The size of this volume must be at least 40Gi to store the filesystem of an OS.
* `controller-spool-pvc`: Stores cluster configuration data and is mounted on the controller node.

jail-pvc.yaml

```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jail-pvc
  namespace: fpt-hpc
spec:
  storageClassName: default
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
```

controller-spool-pvc.yaml

```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: controller-spool-pvc
  namespace: fpt-hpc
spec:
  storageClassName: default
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
```

Notes:

* These volumes are all mandatory. They must provide ReadWriteMany and the PVC names must remain exactly as above.
* For convenience in deployment, we have used dynamic provisioning volume on the FPTCloud Managed K8s product.
* For production environments, we recommend mounting the root volume from a static partition belonging to a file server, to facilitate migration and maintenance of the Slurm cluster.

**Step 3:** Download the SlurmCluster Helm chart and configure parameters

```
helm repo add xplat-fke https://registry.fke.fptcloud.com/chartrepo/xplat-fke
helm repo update
helm repo list
helm search repo slurm
helm pull xplat-fke/helm-slurm-cluster --version 1.14.10 --untar=true
```

**Note**: Adjust the slurm-cluster version to match the version of Slurm Operator.

To learn more about the parameters of a Slurm cluster, we recommend you to read section 4: Parameters in the Slurm cluster

```
cd helm-slurm-cluster/
vi values.yaml
```

In the values.yaml file of the downloaded folder, you need to adjust several important fields such as:

<table data-header-hidden><thead><tr><th valign="top"></th><th valign="top"></th></tr></thead><tbody><tr><td valign="top">Field</td><td valign="top">Description</td></tr><tr><td valign="top">slurmNodes.worker.size</td><td valign="top">Number of worker nodes</td></tr><tr><td valign="top">slurmNodes.worker.size.spool.volumeClaimTemplateSpec.storageClassName</td><td valign="top">StorageClass for worker-node state storage</td></tr><tr><td valign="top">slurmNode.login.sshRootPublicKeys</td><td valign="top">List of root user public keys for login nodes</td></tr><tr><td valign="top">SlurmNode.accounting.mariadbOperator.storage.volumeClaimTemplate.storageClassName</td><td valign="top">StorageClass for SlurmDBD database storage</td></tr></tbody></table>

After configuring the cluster as needed, run:

```
helm install fpt-hpc ./ -n fpt-hpc
```

**Step 4:** Wait until all Slurm pods are in the running state

This process takes around 20 minutes when installing a Slurm cluster for the first time on a K8s cluster. It includes 2 phases: phase 1 runs setup jobs, and phase 2 installs Slurm components.

When all components are ready, find the login node’s public IP using:

```
kubectl get svc -n fpt-hpc | grep login
```

SSH into the Slurm cluster head node:

```
ssh root@<IP_login_svc>
```

If using nodeshell:

```
chroot /mnt/jail
sudo -i
```

Run tests:

```
srun --nodes=2 --gres=gpu:1 nvidia-smi -L
salloc --nodes=1 --ntasks=1 --mem=4G --time=00:20:00 --gres=gpu:1
```

### Run a Sample Job on the Slurm Cluster

After logging in successfully to the Slurm cluster, you can verify its operation by training the minGPT model following the steps below:

**Step 1:** Clone the pytorch/examples repository

```
mkdir /shared
cd /shared
git clone https://github.com/pytorch/examples
```

Step 2: Navigate to the minGPT-ddp folder & install the necessary packages

```
cd examples/distributed/minGPT-ddp
pip3 install -r requirements.txt
pip3 install numpy
```

Due to the shared-root mechanism, we only need to run this once; these packages will automatically sync to all remaining worker nodes.

**Note**: In production environments, we recommend using a conda environment/container to create a training environment instead of installing packages directly into the global environment.

**Step 3**: Edit the Slurm script

```
vi mingpt/slurm/sbatch_run.sh
```

**Note**: Adjust the path to the main.py file inside sbatch\_run.sh to the actual path in the mingpt folder.

**Step 4**: Run the sample Slurm job

```
sbatch mingpt/slurm/sbatch_run.sh
```

**Step 5**: Check status

```
squeue
scontrol show job <<job_id>>
cat <<log.out>>
```

## Parameters in the Slurm cluster

In section 3 of the guide for running Slurm on K8s, we guided you to adjust the most important parameters. In this section, we will go deeper into understanding the parameters/attributes defined for a Slurm cluster; you can also read the comments in the values.yaml file of the Slurm cluster custom resource downloaded in section 4 for additional information.

<table data-header-hidden><thead><tr><th valign="top"></th><th valign="top"></th><th valign="top"></th></tr></thead><tbody><tr><td valign="top">Attribute</td><td valign="top">Sample value</td><td valign="top">Description</td></tr><tr><td valign="top">clusterName</td><td valign="top">"fpt-hpc"</td><td valign="top">cluster name (note: do not change)</td></tr><tr><td valign="top">k8sNodeFilters</td><td valign="top">N/A</td><td valign="top">Divides the K8s cluster into two lists: GPU nodes (to deploy slurm workers) and non-GPU nodes to deploy other components. In case the cluster only has GPU nodes, these two lists can be the same.</td></tr><tr><td valign="top">volumeSources</td><td valign="top"><p>volumeSources:</p><p>  - name: controller-spool</p><p>    persistentVolumeClaim:</p><p>      claimName: "controller-spool-pvc"</p><p>      readOnly: false</p><p>  - name: jail</p><p>    persistentVolumeClaim:</p><p>      claimName: "jail-pvc"</p><p>      readOnly: false</p></td><td valign="top">Defines the PersistentVolumeClaims used by the containers representing the components (worker, login, controller nodes, …) of the Slurm cluster.</td></tr><tr><td valign="top">periodicChecks</td><td valign="top">N/A</td><td valign="top">A periodic job to check the status of a node. If that node contains a GPU with issues, drain that node.</td></tr><tr><td valign="top">summonses</td><td valign="top">N/A</td><td valign="top">Defines the number and configuration of component nodes in a Slurm cluster (login node, worker node, …)</td></tr><tr><td valign="top">slurmNodes.accounting</td><td valign="top"><p>enabled: true</p><p>mariadbOperator:</p><p>  enabled: true</p><p>  resources:</p><p>    cpu: "1000m"</p><p>    memory: "1Gi"</p><p>    ephemeralStorage: "5Gi"</p><p>  replicas: 1</p><p>  replication: {}</p><p>  storage:</p><p>    ephemeral: false</p><p>    volumeClaimTemplate:</p><p>      accessModes:</p><p>      - ReadWriteOnce</p><p>      resources:</p><p>        requests:</p><p>          storage: 10Gi</p><p>      storageClassName: xplat-nfs</p></td><td valign="top">Configuration of accounting node. Here we use the mariadb operator to create the database; you can also use an external database (read more in values.yaml).</td></tr><tr><td valign="top">slurmNodes.controller</td><td valign="top"><p>size: 1</p><p>k8sNodeFilterName: "no-gpu"</p><p>slurmctld:</p><p>  port: 6817</p><p>  resources:</p><p>    cpu: "1000m"</p><p>    memory: "3Gi"</p><p>    ephemeralStorage: "20Gi"</p><p>munge:</p><p>  resources:</p><p>    cpu: "1000m"</p><p>    memory: "1Gi"</p><p>    ephemeralStorage: "5Gi"</p><p>volumes:</p><p>  spool:</p><p>    volumeSourceName: "controller-spool"</p><p>  jail:</p><p>    volumeSourceName: "jail"</p></td><td valign="top">Configuration of the controller node, with 1 controller node + mounting 2 volumes (spool &#x26; jail shared root space) into this node.</td></tr><tr><td valign="top">slurmNodes.worker</td><td valign="top"><p>size: 8</p><p>k8sNodeFilterName: "gpu"</p><p>cgroupVersion: v2</p><p>slurmd:</p><p>  port: 6818</p><p>  resources:</p><p>    cpu: "110000m"</p><p>    memory: "1220Gi"</p><p>    ephemeralStorage: "55Gi"</p><p>    gpu: 8</p><p>    rdma: 1</p><p>munge:</p><p>  resources:</p><p>    cpu: "2000m"</p><p>    memory: "4Gi"</p><p>    ephemeralStorage: "5Gi"</p><p>volumes:</p><p>  spool:</p><p>    volumeClaimTemplateSpec:</p><p>      storageClassName: "xplat-nfs"</p><p>      accessModes: ["ReadWriteOnce"]</p><p>      resources:</p><p>        requests:</p><p>          storage: "120Gi"</p><p>  jail:</p><p>    volumeSourceName: "jail"</p><p> </p></td><td valign="top">Configuration of worker nodes, with 8 worker nodes, each node has 8 GPUs + mounts the jail (shared root space) volume into this node.</td></tr><tr><td valign="top">slurmNodes.login</td><td valign="top"><p>login:</p><p>  size: 2</p><p>  k8sNodeFilterName: "no-gpu"</p><p>  sshd:</p><p>    port: 22</p><p>    resources:</p><p>      cpu: "3000m"</p><p>      memory: "9Gi"</p><p>      ephemeralStorage: "30Gi"</p><p>  sshRootPublicKeys:</p><p>    - "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHke7B5+kGXx/Dwr76NI5KxfAAEkqcxbh6/8SV7tnpUP someorganize@example.com"</p><p>  sshdServiceLoadBalancerIP: ""</p><p>  sshdServiceNodePort: 30022</p><p>  munge:</p><p>    resources:</p><p>      cpu: "500m"</p><p>      memory: "500Mi"</p><p>      ephemeralStorage: "5Gi"</p><p>  volumes:</p><p>    jail:</p><p>      volumeSourceName: "jail"</p><p> </p></td><td valign="top">Configuration of login nodes, with 2 login nodes, exposing the sshd service using the load-balancer service type in K8s, using the public key defined in sshRootPublicKeys for the root user, and mounting the same volumes as the controller node &#x26; worker node.</td></tr><tr><td valign="top">slurmNodes.exporter</td><td valign="top"><p>exporter:</p><p>  enabled: true</p><p>  size: 1</p><p>  k8sNodeFilterName: "no-gpu"</p><p>  exporter:</p><p>    resources:</p><p>      cpu: "250m"</p><p>      memory: "256Mi"</p><p>      ephemeralStorage: "500Mi"</p><p>  munge:</p><p>    resources:</p><p>      cpu: "1000m"</p><p>      memory: "1Gi"</p><p>      ephemeralStorage: "5Gi"</p><p>  volumes:</p><p>    jail:</p><p>      volumeSourceName: "jail"</p><p> </p></td><td valign="top">Install the node exporter for monitoring.</td></tr></tbody></table>

For more detailed information, please read the comments in the values.yaml file defining the Slurm cluster configuration.

## Common use cases

### Add user/login

* Add users: To add an SSH key for root, you simply need to edit the Slurm cluster CR:

```
kubectl edit SlurmCluster fpt-hpc -n fpt-hpc
```

* In the login node configuration section, navigate to the sshRootPublicKeys attribute and add your desired public key.
* To add a regular user, you do the same as adding a user on a Linux host:

```
sudo adduser <<user_name>>
```

* Modify login settings

By default, we expose the login node through a public Load Balancer. This may not be suitable for some requirements. Therefore, you can change the LB type to private, use the port-forward mechanism to access the Slurm cluster, or customize it according to your needs at our portal + LB node.

### Scale up/down worker nodes

To edit the number of worker nodes specifically and the number of other types of nodes in general, we simply need to edit the Slurm cluster CR:

```
kubectl edit SlurmCluster fpt-hpc -n fpt-hpc
```

In the worker nodes configuration section, navigate to the “size” field and edit the number of worker nodes as desired.

**Notes:**

* When scaling up the number of worker nodes, the new node will be automatically added to the list of worker nodes in the Slurm controller node and be ready to run jobs.
* When scaling down nodes, you need to manually delete the node on the Slurm controller using the command:

```
scontrol delete nodeName=<>
```

The node list in a cluster will always be: worker-\[0, (size - 1)].

### Migrate Slurm cluster to another K8s cluster&#x20;

Thanks to the flexibility of K8s and network file storage, we can easily move a Slurm cluster from one K8s cluster to another. What needs to be done is mounting & recreating the jail-pvc on the new Slurm K8s cluster and performing the steps to create the Slurm K8s cluster again.

### Mounting external volumes into the Slurm cluster

To mount a volume into the Slurm cluster, you need to create that volume first, then deploy it as a PV & PVC on K8s. The following example uses dynamic provisioning to create this PV/PVC (in production environments, we recommend using static provisioning volumes to ensure data safety).

**Step 1**: Create PVC

```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: jail-submount-mlperf-sd-pvc
spec:
  storageClassName: default
  accessModes:
    - ReadWriteMany
  resources:
     requests:
        storage: 100Gi
```

&#x20;**Step 2**: Declare this volume in the Slurm cluster

Edit the `volumeSource` field in the Slurm cluster CR:

```
kubectl edit SlurmCluster fpt-hpc -n fpt-hpc
```

```
volumeSources:
  - name: controller-spool
    persistentVolumeClaim:
      claimName: "controller-spool-pvc"
      readOnly: false
  - name: jail
    persistentVolumeClaim:
      claimName: "jail-pvc"
      readOnly: false
  - name: mlperf-sd
    persistentVolumeClaim:
      claimName: "jail-submount-mlperf-sd-pvc"
      readOnly: false
```

&#x20;**Step 3**: Mount these volumes into the login and worker nodes in the Slurm cluster

In the login node:

```
volumes:
  jail:
    volumeSourceName: "jail"
  jailSubMounts:
    - name: "mlcommons-sd-bench-data"
      mountPath: "/mnt/data-hps"
      volumeSourceName: "mlperf-sd"
```

In the worker node:

```
volumes:
  spool:
    volumeClaimTemplateSpec:
      storageClassName: "xplat-nfs"
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: "120Gi"
  jail:
    volumeSourceName: "jail"
  jailSubMounts:
    - name: "mlcommons-sd-bench-data"
      mountPath: "/mnt/data-hps"
      volumeSourceName: "mlperf-sd"
```

**Note**: The mount path must be the same across all worker nodes and login nodes.

&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ai-docs.fptcloud.com/fpt-gpu-cloud/gpu-cluster/managed-k8s-with-metal-cloud/use-cases/slurm-on-managed-gpu-cluster.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.