> For the complete documentation index, see [llms.txt](https://ai-docs.fptcloud.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ai-docs.fptcloud.com/fpt-ai-studio/services/model-fine-tuning/tutorials/how-to-monitor-a-pipeline.md). # How to Monitor a Pipeline? Model Fine-tuning provides metrics and logs to help you monitor and troubleshoot your workloads. To view your logs and metrics: 1. Open the **Pipeline list** 2. Open the **Execution history**, by clicking **Pipeline name** in the **Name** column. 3. Open the **Execution details**, by clicking **Execution name** in the **Name** column. 4. Navigate to **Model metrics**, **System metrics** or **Logs**. This gives you model metrics and logs, making it easy to monitor your execution's activity or diagnose issues. ## Model Metrics {% hint style="info" %} **Note**: Model metrics are enabled only when the execution pipeline is in the **Running** status and at the **Training** stage. {% endhint %}

**Model metrics** are collected to track the training performance of AI models during and after the fine-tuning process. These metrics help detect training anomalies, guide hyperparameter adjustments and improve model performance. **Training metrics:**

Metric	What it evaluates
loss	Measures how well the model is learning. A high loss means poor prediction; a low loss means the model is fitting the data well.
learning_rate	Controls how fast the model learns. A learning rate that’s too high can cause instability; too low can slow down training.
grad_norms	Indicates the magnitude of gradients. Helps detect issues like vanishing or exploding gradients, which affect learning.
epoch	Tracks how many full passes the model has made over the training data. Useful for monitoring learning progress over time.

**Evaluation metrics:** {% hint style="info" %} Note: Only shown when evaluation data is used. {% endhint %}

Metric	What it evaluates
eval_runtime	Measures how long the evaluation process takes. Useful for performance benchmarking.
eval_samples_per_second	Indicates evaluation throughput. Higher is better for faster model validation.
eval_steps_per_second	Measures how many evaluation steps are completed per second. Reflects evaluation efficiency.
eval_loss	Measures how well the model generalizes to unseen data. Helps detect overfitting or underfitting.

**Training performance metrics:**

Metric	What it evaluates
train_runtime	Total time spent training. Useful for estimating training cost and efficiency.
train_samples_per_second	Measures training throughput. Higher values indicate faster training.
train_steps_per_second	Indicates how many training steps are completed per second. Reflects training speed.
total_flos	Total floating point operations used. Helps estimate computational cost and model complexity.
train_loss	Measures how well the model fits the training data. Should decrease over time if training is effective.

## System Metrics

**System metrics** are collected to monitor the hardware and infrastructure performance during model training and evaluation. These metrics help identify resource bottlenecks, optimize hardware utilization, and ensure stable and efficient training processes.

Metric	What it evaluates
GPU Utilization (%)	Measures how much of the GPU’s processing power is being used. High values indicate the GPU is actively working; low values may suggest bottlenecks elsewhere (e.g., data loading).
CPU Utilization (%)	Indicates how much of the CPU is being used. Useful for detecting whether CPU is a bottleneck in data preprocessing or I/O operations.
GPU Power Usage (W)	Shows the actual power consumption of the GPU in watts. Helps monitor energy efficiency and thermal limits.
GPU Power Usage (%)	Percentage of the GPU’s maximum power capacity being used. Useful for understanding how close the GPU is to its power limits.
GPU Memory Usage (MB)	Amount of GPU memory currently in use. Important for ensuring the model and data fit within available memory.
GPU Memory Usage (%)	Percentage of total GPU memory being used. High usage may cause memory overflow or instability.
RAM Usage (MB)	Amount of system RAM being used. Helps monitor memory pressure from data loading, preprocessing, or model components.
RAM Usage (%)	Percentage of total system RAM in use. High values may indicate a need for memory optimization or hardware upgrades.
Network - Bandwidth	Measures data transfer rate over the network. Important in distributed training or when loading data from remote sources. Low bandwidth can slow down training.

## Logs **Logs** provides detailed insights into the execution of a specific fine-tuning pipeline. It helps you monitor progress, troubleshoot issues, and maintain transparency in the model training workflows.

You can use Logs to: * **Trace the sequence of events** using timestamps * **Check status messages** for errors or warnings * **Download logs** before contacting support for faster resolution.
--- # Agent Instructions This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com. ## Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter: ``` GET https://ai-docs.fptcloud.com/fpt-ai-studio/services/model-fine-tuning/tutorials/how-to-monitor-a-pipeline.md?ask=&goal= ``` `ask` is the immediate question: it should be specific, self-contained, and written in natural language. `goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.