✌️Monitoring

The monitoring feature is bundled with AI Infrastructure – Metal Cloud service.

Collecting and visualizing metrics, logs, and events can help identify potential issues and optimize future workloads. You may select an observability solution that best fits their needs.

Metrics

A Cluster (in the same VPC)

A single Server

Total number of nodes and down nodes

GPU model, Driver & CUDA version

Power state

Uptime

Total number of GPUs and down GPUs

GPU Utilization

GPU Memory

CPU Utilization

System Memory

Root Storage Usage

Local Disk Usage

Details of each GPUs Power consumption, Temperature, GPU Utilization, VRAM usage

Network Bandwidth Inbound/ Outbound

Network Packets Sent/Received

Network Error rate Receive/Transmit

Network InfiniBand Bandwidth/Packet/Error

System Fan Speed

System Voltage

Common Alerts

*For custom or advanced metrics as requested, we offer a Cloud Monitoring (FMON) service available for an additional charge.

Last updated