✌️How to Monitor a Pipeline?

Model Fine-tuning provides metrics and logs to help you monitor and troubleshoot your workloads. To view your logs and metrics:

  1. Open the Pipeline list

  2. Open the Execution history, by clicking Pipeline name in the Name column.

  3. Open the Execution details, by clicking Execution name in the Name column.

  4. Navigate to Model metrics, System metrics or Logs.

This gives you model metrics and logs, making it easy to monitor your execution's activity or diagnose issues.

Model Metrics

Note: Model metrics are enabled only when the execution pipeline is in the Running status and at the Training stage.

Model metrics are collected to track the training performance of AI models during and after the fine-tuning process. These metrics help detect training anomalies, guide hyperparameter adjustments and improve model performance.

Training metrics:

Metric
What it evaluates

loss

Measures how well the model is learning. A high loss means poor prediction; a low loss means the model is fitting the data well.

learning_rate

Controls how fast the model learns. A learning rate that’s too high can cause instability; too low can slow down training.

grad_norms

Indicates the magnitude of gradients. Helps detect issues like vanishing or exploding gradients, which affect learning.

epoch

Tracks how many full passes the model has made over the training data. Useful for monitoring learning progress over time.

Evaluation metrics:

Note: Only shown when evaluation data is used.

Metric
What it evaluates

eval_runtime

Measures how long the evaluation process takes. Useful for performance benchmarking.

eval_samples_per_second

Indicates evaluation throughput. Higher is better for faster model validation.

eval_steps_per_second

Measures how many evaluation steps are completed per second. Reflects evaluation efficiency.

eval_loss

Measures how well the model generalizes to unseen data. Helps detect overfitting or underfitting.

Training performance metrics:

Metric
What it evaluates

train_runtime

Total time spent training. Useful for estimating training cost and efficiency.

train_samples_per_second

Measures training throughput. Higher values indicate faster training.

train_steps_per_second

Indicates how many training steps are completed per second. Reflects training speed.

total_flos

Total floating point operations used. Helps estimate computational cost and model complexity.

train_loss

Measures how well the model fits the training data. Should decrease over time if training is effective.

System Metrics

System metrics are collected to monitor the hardware and infrastructure performance during model training and evaluation. These metrics help identify resource bottlenecks, optimize hardware utilization, and ensure stable and efficient training processes.

Metric
What it evaluates

GPU Utilization (%)

Measures how much of the GPU’s processing power is being used. High values indicate the GPU is actively working; low values may suggest bottlenecks elsewhere (e.g., data loading).

CPU Utilization (%)

Indicates how much of the CPU is being used. Useful for detecting whether CPU is a bottleneck in data preprocessing or I/O operations.

GPU Power Usage (W)

Shows the actual power consumption of the GPU in watts. Helps monitor energy efficiency and thermal limits.

GPU Power Usage (%)

Percentage of the GPU’s maximum power capacity being used. Useful for understanding how close the GPU is to its power limits.

GPU Memory Usage (MB)

Amount of GPU memory currently in use. Important for ensuring the model and data fit within available memory.

GPU Memory Usage (%)

Percentage of total GPU memory being used. High usage may cause memory overflow or instability.

RAM Usage (MB)

Amount of system RAM being used. Helps monitor memory pressure from data loading, preprocessing, or model components.

RAM Usage (%)

Percentage of total system RAM in use. High values may indicate a need for memory optimization or hardware upgrades.

Network - Bandwidth

Measures data transfer rate over the network. Important in distributed training or when loading data from remote sources. Low bandwidth can slow down training.

Logs

Logs provides detailed insights into the execution of a specific fine-tuning pipeline. It helps you monitor progress, troubleshoot issues, and maintain transparency in the model training workflows.

You can use Logs to:

  • Trace the sequence of events using timestamps

  • Check status messages for errors or warnings

  • Download logs before contacting support for faster resolution.

Last updated