Guideline for Fine-tuning with LoRA.md
This guide focuses on fine-tuning with LoRA and deploying a model with on-demand and serverless hosting.
1. Recommended configuration
We currently support fine-tuning models with the following architectures:
Qwen-3 / Qwen3-4B-Instruct (template 1 GPU)
Small dataset: 1e-5 → 5e-5
medium/large dataset: 5e-5 → 1e-4
Small dataset: 1–3 medium: 3–5
google/gemma-3-27b-it (template 2 GPUs)
Small dataset: 1e-5 → 5e-5
medium/large dataset: 5e-5 → 1e-4
3 (start)
meta-llama/Llama-3.3-70B (template 4 GPUs)
Small dataset: 1e-5 → 2e-5
medium/large dataset: 2e-5 → 1e-4
3 (start)
Small dataset: Under 1000 samples
Medium: 1K-10K samples
Large: Over 10K samples
2. Dataset Format
Alpaca
Easy to prepare, suitable for basic instruction tuning use cases (summarization, QA, rewriting).
ShareGPT
Suitable when users want to fine-tune a chatbot with multi-turn conversation capability.
In the fine-tuning process, data refers to a curated set of example inputs and outputs used to retrain a pre-trained AI model. This data teaches the model to adapt its behavior to suit your specific domain, task, or tone of voice.
2.1. Alpaca
Alpaca uses a very simple structure to fine-tune the model with Instruction-following format with input, output pairs for supervised fine-tuning tasks. The basic structure includes:
Instruction: A string containing the specific task or request that the model needs to perform.
Input: A string containing the information that the model needs to process in order to carry out the task.
Output: A string representing the result the model should return, generated from processing the instruction and input.
Detailed Structure:
[ { “instruction”: “string”, “input”: “string”, “output” “string” } ]
Examples:
Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/alpaca
2.2. ShareGPT
a. Trainer = SFT
ShareGPT is designed to represent multi-turn conversations (back-and-forth chats) between a user and an AI assistant. It is commonly used when training or fine-tuning models for dialogue systems or chatbots that need to handle contextual conversation over multiple turns.
Each data sample consists of a conversations array, where each turn in the chat includes:
from: Who is speaking — usually
"human"or"system".value: The actual message text from that speaker.
Detailed Structure:
[ { “conversations”: [ { “from”: “string”, “value”: “string” } ] } ]
Examples:
Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/sharegpt
2.3. ShareGPT_Image
ShareGPT_Image is an extension of the ShareGPT multi-turn chat format, designed specifically for multi-modal training — that is, training models that handle both text and images in conversations.
It’s used in fine-tuning vision-language models (VLMs), which need to process images alongside natural language.
The structure includes:
A list of chat turns under
"message"(same as ShareGPT).A field called
"image"or"image_path"that points to the image used in the conversation (using format png, jpg, jpeg)
Note
Must include the
imagetoken in the chat content where an image should appear.If there are multiple images:
Image paths must be defined in an
imagesarray.The positions of the images in the chat flow are indicated by the
imagetokensThe number of
imagetokens in the chat must match the number of items in theimagesarray.Images will be mapped in order of appearance, with each
imagetoken replaced by the corresponding image from theimagesarray.
Detailed Structures:
[ { “message”: [ { “role”: “string”, “content”: “string” } ], “images”: [ “images/0.jpg” ] } ]
Examples:
Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/sharegpt-image
3. Training and Validation Data
Training data (required): the main dataset used for model training.
Validation data (recommended): used to evaluate model quality during training.
Data Split Recommendation
Train/Validation split:** 80% / 20%.
Small dataset (<1,000 samples): you may use the entire dataset for training, but quality will be harder to verify.
Large dataset (>10,000 samples): always prepare a separate validation set.
4. How to Prepare Your Dataset?
In addition to checking file format (CSV, JSON, JSONL, Parquet, ZIP), the system should also validate the dataset content before fine-tuning.
SFT
Alpaca
- CSV - JSON - JSONLINES - ZIP - PARQUET
Limit 100MB
SFT
ShareGPT
- JSON - JSONLINES - ZIP - PARQUET
Limit 100MB
SFT
ShareGPT_Image
- ZIP - PARQUET
Limit 100MB
4.1. Basic Validation (Format-level)
File size must not exceed 100MB.
Must be in a supported format.
Training set must contain at least 100 samples.
4.2. Content Validation (Content-level)
Structure:
Each record must contain 2 fields:
promptandcompletion.Records with missing or empty fields are not accepted.
Text Quality:
Both prompt and completion must be text (not empty objects, numbers, or binary).
Must not contain invalid characters or tokens.
Sample Length:
Each sample ≤ 2048 tokens (standard).
Some models support up to 16k tokens, but require correct configuration.
If exceeded → the system will raise an error.
Duplication:
The system automatically warns if duplicate samples > 10%.
Users are encouraged to clean duplicates before upload.
Balance:
Prompts and completions should be well-distributed, not biased toward a single type of question.
4.3. Advanced Validation (Recommended)
Check for dataset bias (no harmful or sensitive content).
Ensure UTF-8 encoding to avoid parsing errors.
5. Common Issues & How to Avoid Them
Context drift: Always keep prompt formatting consistent.
Overly long or redundant completions: Keep only the necessary output.
Noisy data (typos, duplicates): Clean before uploading.
Oversized files (>100MB): Split or compress into ZIP.
Exceeding max length (2048 or 16k tokens): Normalize data before upload.
Last updated
Was this helpful?
