Cluster monitoring (GPU Telemetry)

FPT Cloud uses NVIDIA GPU Telemetry integrated with kube-prometheus-stack, a monitoring and surveillance toolkit for GPU-based systems on Kubernetes. The monitoring toolkit includes a collector, a time-series database that stores metrics, and visualization (visual interface). The toolkit uses popular open source applications Prometheus and Grafana.

Prometheus also includes Alertmanager to create and manage alerts. Prometheus is deployed alongside kube-state-metrics and node_exporter to display cluster-level metrics for Kubernetes API objects and node-level metrics, such as GPU utilization.

  • Check custom GPU metrics using the following command:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM

  • Access Prometheus to check DCGM GPU metrics

#Forward the Prometheus service to access via a web browser
kubectl port-forward service/kube-prometheus-stack-1679-prometheus 9090:63090
*where 9090 is the port of the prometheus pod, 63090 is the Local Port of your computer (client) #Access Prometheus on a web browser using the following link: 
http://localhost:63090/
  • On the Prometheus interface, perform the following steps to check the DCGM GPU metrics

  • Access the Grafana Dashboard

#Forward the Grafana service to access via a web browser
kubectl port-forward service/kube-prometheus-stack-1679050354-grafana 80:63080
*with 80 being the port of the Grafana pod, 63080 being the Local Port of your computer (client) #Access Prometheus on a web browser using the following link: 
http://localhost:63080/
  • The default username and password to log in to Grafana are:

User: admin

Password: prom-operator

  • Import Grafana Dashboard for GPU

To import the Dashboard, access the Grafana interface, go to Dashboards > Manage > Import. If using the

FPT Cloud Dashboard, enter the FPT Cloud GPU Dashboard json content > Load.

Last updated