✌️Monitoring
The monitoring feature is bundled with AI Infrastructure – Metal Cloud service.
Collecting and visualizing metrics, logs, and events can help identify potential issues and optimize future workloads. You may select an observability solution that best fits their needs.
Metrics
A Cluster (in the same VPC)
A single Server
Total number of nodes and down nodes
✔
GPU model, Driver & CUDA version
✔
Power state
✔
Uptime
✔
Total number of GPUs and down GPUs
✔
✔
GPU Utilization
✔
✔
GPU Memory
✔
✔
CPU Utilization
✔
✔
System Memory
✔
✔
Root Storage Usage
✔
✔
Local Disk Usage
✔
✔
Details of each GPUs Power consumption, Temperature, GPU Utilization, VRAM usage
✔
Network Bandwidth Inbound/ Outbound
✔
✔
Network Packets Sent/Received
✔
✔
Network Error rate Receive/Transmit
✔
Network InfiniBand Bandwidth/Packet/Error
✔
System Fan Speed
✔
System Voltage
✔
Common Alerts
✔
*For custom or advanced metrics as requested, we offer a Cloud Monitoring (FMON) service available for an additional charge.
Last updated
