Guideline for Fine-tuning with LoRA.md

This guide focuses on fine-tuning with LoRA and deploying a model with on-demand and serverless hosting.

We currently support fine-tuning models with the following architectures:

Model (Resource)
Learning Rate
Suggested Epochs

Qwen-3 / Qwen3-4B-Instruct (template 1 GPU)

Small dataset: 1e-5 → 5e-5 medium/large dataset: 5e-5 → 1e-4

Small dataset: 1–3 medium: 3–5

google/gemma-3-27b-it (template 2 GPUs)

Small dataset: 1e-5 → 5e-5 medium/large dataset: 5e-5 → 1e-4

3 (start)

meta-llama/Llama-3.3-70B (template 4 GPUs)

Small dataset: 1e-5 → 2e-5 medium/large dataset: 2e-5 → 1e-4

3 (start)

  • Small dataset: Under 1000 samples

  • Medium: 1K-10K samples

  • Large: Over 10K samples

2. Dataset Format

Dataset Type
Link to Sample
Note

Alpaca

Easy to prepare, suitable for basic instruction tuning use cases (summarization, QA, rewriting).

ShareGPT

Suitable when users want to fine-tune a chatbot with multi-turn conversation capability.

ShareGPT_Image

Designed for advanced users working on multimedia AI.

In the fine-tuning process, data refers to a curated set of example inputs and outputs used to retrain a pre-trained AI model. This data teaches the model to adapt its behavior to suit your specific domain, task, or tone of voice.

2.1. Alpaca

Alpaca uses a very simple structure to fine-tune the model with Instruction-following format with input, output pairs for supervised fine-tuning tasks. The basic structure includes:

  • Instruction: A string containing the specific task or request that the model needs to perform.

  • Input: A string containing the information that the model needs to process in order to carry out the task.

  • Output: A string representing the result the model should return, generated from processing the instruction and input.

Detailed Structure:

[ { “instruction”: “string”, “input”: “string”, “output” “string” } ]

Examples:

Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/alpaca

2.2. ShareGPT

a. Trainer = SFT

ShareGPT is designed to represent multi-turn conversations (back-and-forth chats) between a user and an AI assistant. It is commonly used when training or fine-tuning models for dialogue systems or chatbots that need to handle contextual conversation over multiple turns.

Each data sample consists of a conversations array, where each turn in the chat includes:

  • from: Who is speaking — usually "human" or "system".

  • value: The actual message text from that speaker.

Detailed Structure:

[ { “conversations”: [ { “from”: “string”, “value”: “string” } ] } ]

Examples:

Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/sharegpt

2.3. ShareGPT_Image

ShareGPT_Image is an extension of the ShareGPT multi-turn chat format, designed specifically for multi-modal training — that is, training models that handle both text and images in conversations.

It’s used in fine-tuning vision-language models (VLMs), which need to process images alongside natural language.

The structure includes:

  • A list of chat turns under "message" (same as ShareGPT).

  • A field called "image" or "image_path" that points to the image used in the conversation (using format png, jpg, jpeg)

Note

  • Must include the image token in the chat content where an image should appear.

  • If there are multiple images:

    • Image paths must be defined in an images array.

    • The positions of the images in the chat flow are indicated by the image tokens

    • The number of image tokens in the chat must match the number of items in the images array.

    • Images will be mapped in order of appearance, with each image token replaced by the corresponding image from the images array. file

Detailed Structures:

[ { “message”: [ { “role”: “string”, “content”: “string” } ], “images”: [ “images/0.jpg” ] } ]

Examples:

Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/sharegpt-imagearrow-up-right

3. Training and Validation Data

  • Training data (required): the main dataset used for model training.

  • Validation data (recommended): used to evaluate model quality during training.

Data Split Recommendation

  • Train/Validation split:** 80% / 20%.

  • Small dataset (<1,000 samples): you may use the entire dataset for training, but quality will be harder to verify.

  • Large dataset (>10,000 samples): always prepare a separate validation set.

4. How to Prepare Your Dataset?

In addition to checking file format (CSV, JSON, JSONL, Parquet, ZIP), the system should also validate the dataset content before fine-tuning.

Trainer
Supported data format
Supported file format
Supported file size

SFT

Alpaca

- CSV - JSON - JSONLINES - ZIP - PARQUET

Limit 100MB

SFT

ShareGPT

- JSON - JSONLINES - ZIP - PARQUET

Limit 100MB

SFT

ShareGPT_Image

- ZIP - PARQUET

Limit 100MB

4.1. Basic Validation (Format-level)

  • File size must not exceed 100MB.

  • Must be in a supported format.

  • Training set must contain at least 100 samples.

4.2. Content Validation (Content-level)

Structure:

  • Each record must contain 2 fields: prompt and completion.

  • Records with missing or empty fields are not accepted.

Text Quality:

  • Both prompt and completion must be text (not empty objects, numbers, or binary).

  • Must not contain invalid characters or tokens.

Sample Length:

  • Each sample ≤ 2048 tokens (standard).

  • Some models support up to 16k tokens, but require correct configuration.

  • If exceeded → the system will raise an error.

Duplication:

  • The system automatically warns if duplicate samples > 10%.

  • Users are encouraged to clean duplicates before upload.

Balance:

  • Prompts and completions should be well-distributed, not biased toward a single type of question.

  • Check for dataset bias (no harmful or sensitive content).

  • Ensure UTF-8 encoding to avoid parsing errors.

5. Common Issues & How to Avoid Them

  • Context drift: Always keep prompt formatting consistent.

  • Overly long or redundant completions: Keep only the necessary output.

  • Noisy data (typos, duplicates): Clean before uploading.

  • Oversized files (>100MB): Split or compress into ZIP.

  • Exceeding max length (2048 or 16k tokens): Normalize data before upload.

Last updated

Was this helpful?