# How to Monitor a Pipeline?

Model Fine-tuning provides metrics and logs to help you monitor and troubleshoot your workloads. To view your logs and metrics:

1. Open the **Pipeline list**
2. Open the **Execution history**, by clicking **Pipeline name** in the **Name** column.
3. Open the **Execution details**, by clicking **Execution name** in the **Name** column.
4. Navigate to **Model metrics**, **System metrics** or **Logs**.

This gives you model metrics and logs, making it easy to monitor your execution's activity or diagnose issues.

## Model Metrics

{% hint style="info" %} <mark style="color:$warning;">**Note**</mark>: Model metrics are enabled only when the execution pipeline is in the **Running** status and at the **Training** stage.
{% endhint %}

<figure><img src="/files/dzWV8oEVzMqRWfqqcVuf" alt=""><figcaption></figcaption></figure>

**Model metrics** are collected to track the training performance of AI models during and after the fine-tuning process. These metrics help detect training anomalies, guide hyperparameter adjustments and improve model performance.

**Training metrics:**

<table><thead><tr><th width="230">Metric</th><th>What it evaluates</th></tr></thead><tbody><tr><td><strong>loss</strong></td><td>Measures how well the model is learning. A high loss means poor prediction; a low loss means the model is fitting the data well.</td></tr><tr><td><strong>learning_rate</strong></td><td>Controls how fast the model learns. A learning rate that’s too high can cause instability; too low can slow down training.</td></tr><tr><td><strong>grad_norms</strong></td><td>Indicates the magnitude of gradients. Helps detect issues like vanishing or exploding gradients, which affect learning.</td></tr><tr><td><strong>epoch</strong></td><td>Tracks how many full passes the model has made over the training data. Useful for monitoring learning progress over time.</td></tr></tbody></table>

**Evaluation metrics:**

{% hint style="info" %} <mark style="color:$warning;">Note:</mark> Only shown when evaluation data is used.
{% endhint %}

<table><thead><tr><th width="230">Metric</th><th>What it evaluates</th></tr></thead><tbody><tr><td><strong>eval_runtime</strong></td><td>Measures how long the evaluation process takes. Useful for performance benchmarking.</td></tr><tr><td><strong>eval_samples_per_second</strong></td><td>Indicates evaluation throughput. Higher is better for faster model validation.</td></tr><tr><td><strong>eval_steps_per_second</strong></td><td>Measures how many evaluation steps are completed per second. Reflects evaluation efficiency.</td></tr><tr><td><strong>eval_loss</strong></td><td>Measures how well the model generalizes to unseen data. Helps detect overfitting or underfitting.</td></tr></tbody></table>

**Training performance metrics:**

<table><thead><tr><th width="230">Metric</th><th>What it evaluates</th></tr></thead><tbody><tr><td><strong>train_runtime</strong></td><td>Total time spent training. Useful for estimating training cost and efficiency.</td></tr><tr><td><strong>train_samples_per_second</strong></td><td>Measures training throughput. Higher values indicate faster training.</td></tr><tr><td><strong>train_steps_per_second</strong></td><td>Indicates how many training steps are completed per second. Reflects training speed.</td></tr><tr><td><strong>total_flos</strong></td><td>Total floating point operations used. Helps estimate computational cost and model complexity.</td></tr><tr><td><strong>train_loss</strong></td><td>Measures how well the model fits the training data. Should decrease over time if training is effective.</td></tr></tbody></table>

## System Metrics

<figure><img src="/files/ONSaxdMvGElX7Vtd4psO" alt=""><figcaption></figcaption></figure>

**System metrics** are collected to monitor the hardware and infrastructure performance during model training and evaluation. These metrics help identify resource bottlenecks, optimize hardware utilization, and ensure stable and efficient training processes.

<table><thead><tr><th width="230">Metric</th><th>What it evaluates</th></tr></thead><tbody><tr><td><strong>GPU Utilization (%)</strong></td><td>Measures how much of the GPU’s processing power is being used. High values indicate the GPU is actively working; low values may suggest bottlenecks elsewhere (e.g., data loading).</td></tr><tr><td><strong>CPU Utilization (%)</strong></td><td>Indicates how much of the CPU is being used. Useful for detecting whether CPU is a bottleneck in data preprocessing or I/O operations.</td></tr><tr><td><strong>GPU Power Usage (W)</strong></td><td>Shows the actual power consumption of the GPU in watts. Helps monitor energy efficiency and thermal limits.</td></tr><tr><td><strong>GPU Power Usage (%)</strong></td><td>Percentage of the GPU’s maximum power capacity being used. Useful for understanding how close the GPU is to its power limits.</td></tr><tr><td><strong>GPU Memory Usage (MB)</strong></td><td>Amount of GPU memory currently in use. Important for ensuring the model and data fit within available memory.</td></tr><tr><td><strong>GPU Memory Usage (%)</strong></td><td>Percentage of total GPU memory being used. High usage may cause memory overflow or instability.</td></tr><tr><td><strong>RAM Usage (MB)</strong></td><td>Amount of system RAM being used. Helps monitor memory pressure from data loading, preprocessing, or model components.</td></tr><tr><td><strong>RAM Usage (%)</strong></td><td>Percentage of total system RAM in use. High values may indicate a need for memory optimization or hardware upgrades.</td></tr><tr><td><strong>Network - Bandwidth</strong></td><td>Measures data transfer rate over the network. Important in distributed training or when loading data from remote sources. Low bandwidth can slow down training.</td></tr></tbody></table>

## Logs

**Logs** provides detailed insights into the execution of a specific fine-tuning pipeline. It helps you monitor progress, troubleshoot issues, and maintain transparency in the model training workflows.

<figure><img src="/files/hB1tlyoEZC7kAwqgu41p" alt=""><figcaption></figcaption></figure>

You can use Logs to:

* **Trace the sequence of events** using timestamps
* **Check status messages** for errors or warnings
* **Download logs** before contacting support for faster resolution.

<br>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ai-docs.fptcloud.com/fpt-ai-studio/services/model-fine-tuning/tutorials/how-to-monitor-a-pipeline.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
