Select Dataset Format

The dataset format depends on the selected trainer.

Trainer
Supported data format
Supported file format
Supported file size

SFT

Alpaca

CSV JSON JSONLINES ZIP PARQUET

Limit 100MB

SFT

ShareGPT

JSON JSONLINES ZIP PARQUET

Limit 100MB

SFT

ShareGPT_Image

ZIP PARQUET

Limit 100MB

DPO

ShareGPT

JSON JSONLINES ZIP PARQUET

Limit 100MB

Pre-training

Corpus

TXT JSON JSONLINES ZIP PARQUET

Limit 100MB

We currently support data formats for fine-tuning include:

a/ Alpaca

Alpaca uses a very simple structure to fine-tune the model with Instruction-following format with input, output pairs for supervised fine-tuning tasks. The basic structure includes:

  • Instruction: A string containing the specific task or request that the model needs to perform.

  • Input: A string containing the information that the model needs to process in order to carry out the task.

  • Output: A string representing the result the model should return, generated from processing the instruction and input.

Detailed Structure:

Examples:

Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/alpaca

Supported file format: .csv, .json, .jsonlines, .zip, .parquet

b/ ShareGPT

b.1/ Trainer = SFT

ShareGPT is designed to represent multi-turn conversations (back-and-forth chats) between a user and an AI assistant. It is commonly used when training or fine-tuning models for dialogue systems or chatbots that need to handle contextual conversation over multiple turns.

Each data sample consists of a conversations array, where each turn in the chat includes:

  • from: Who is speaking — usually "human" or "gpt".

  • value: The actual message text from that speaker.

Detailed Structure:

Examples:

Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/sharegpt

Supported file format: .json, .jsonlines, .zip, .parquet

b.2 / Trainer = DPO

ShareGPT_DPO is a dataset consisting of conversations (prompt + response) collected from ShareGPT, along with pairs of responses that have been ranked by humans based on which one is better. It is used to:

  • Train language models like GPT to respond in ways that align with human preferences.

  • Optimize response quality using the DPO (Direct Preference Optimization) method.

Detailed Structures:

Examples:

Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/sharegpt-dpo

Supported file format: .json, .jsonlines, .zip, .parquet

c/ ShareGPT_Image

ShareGPT_Image is an extension of the ShareGPT multi-turn chat format, designed specifically for multi-modal training — that is, training models that handle both text and images in conversations.

It’s used in fine-tuning vision-language models (VLMs), which need to process images alongside natural language.

The structure includes:

  • A list of chat turns under "message" (same as ShareGPT).

  • A field called "image" or "image_path" that points to the image used in the conversation (using format png, jpg, jpeg)

Notice:

  • Must include the image token in the chat content where an image should appear.

  • If there are multiple images:

    • Image paths must be defined in an images array.

    • The positions of the images in the chat flow are indicated by the image tokens

    • The number of image tokens in the chat must match the number of items in the images array.

    • Images will be mapped in order of appearance, with each image token replaced by the corresponding image from the images array.

Detailed Structures:

Examples:

Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/sharegpt-image

Supported file format: .zip, .parquet

d/ Corpus

Corpus is a collection of text used for training or fine-tuning language models.

Each data point in the corpus includes a "text" field with a string of text. This format is commonly used when you don't need to distinguish between instruction and output, but just want to provide raw text data for the model to learn from

Detailed Structure:

Examples:

Samples: https://github.com/fpt-corp/ai-studio-samples/tree/main/sample-datasets/corpus

Supported file format: .txt, .json, .jsonlines, .zip, .parquet

Last updated