✌️Monitoring

The monitoring feature is bundled with AI Infrastructure – Metal Cloud service.

Collecting and visualizing metrics, logs, and events can help identify potential issues and optimize future workloads. You may select an observability solution that best fits their needs.

Metrics

A Cluster (in the same VPC)

A single Server

Total number of nodes and down nodes

βœ”

GPU model, Driver & CUDA version

βœ”

Power state

βœ”

Uptime

βœ”

Total number of GPUs and down GPUs

βœ”

βœ”

GPU Utilization

βœ”

βœ”

GPU Memory

βœ”

βœ”

CPU Utilization

βœ”

βœ”

System Memory

βœ”

βœ”

Root Storage Usage

βœ”

βœ”

Local Disk Usage

βœ”

βœ”

Details of each GPUs Power consumption, Temperature, GPU Utilization, VRAM usage

βœ”

Network Bandwidth Inbound/ Outbound

βœ”

βœ”

Network Packets Sent/Received

βœ”

βœ”

Network Error rate Receive/Transmit

βœ”

Network InfiniBand Bandwidth/Packet/Error

βœ”

System Fan Speed

βœ”

System Voltage

βœ”

Common Alerts

βœ”

*For custom or advanced metrics as requested, we offer a Cloud Monitoring (FMON) service available for an additional charge.

Last updated