# Select Dataset

First of all, you must prepare the best dataset because it directly impacts how well the model performs on your intended use case.

Here’s what good dataset quality enables:

* **Collect examples to target remaining issues.**
  * If the model still isn't good at certain aspects, add training examples that directly show the model how to do these aspects correctly.
* **Scrutinize existing examples for issues.**
  * If your model has grammar, logic, or style issues, check if your data has any of the same issues. For instance, if the model now says "I will schedule this meeting for you" (when it shouldn't), see if existing examples teach the model to say it can do new things that it can't do
* **Consider the balance and diversity of data.**
  * If 60% of the assistant responses in the data says "I cannot answer this", but at inference time only 5% of responses should say that, you will likely get an overabundance of refusals.
* **Make sure your training examples contain all of the information needed for the response.**
  * If we want the model to compliment a user based on their personal traits and a training example includes assistant compliments for traits not found in the preceding conversation, the model may learn to hallucinate information.
* **Look at the agreement and consistency in the training examples.**
  * If multiple people created the training data, it's likely that model performance will be limited by the level of agreement and consistency between people. For instance, in a text extraction task, if people only agreed on 70% of extracted snippets, the model would likely not be able to do better than this.
* **Make sure all of your training examples are in the same format, as expected for inference**

<figure><img src="/files/1FJMRbLw7WiDpSdWC7gM" alt=""><figcaption></figcaption></figure>

You have two ways to transfer the **Training data** and **Evaluation data**:

* Upload a file
  1. Default value Upload file
  2. Choose a local file from your computer.
  3. (Optional) Click **Download sample** to see an example of the expected format.

**Notice:** Ensure the file matches the selected data format

| Trainer          | Supported data format | Supported file format                             | Supported file size |
| ---------------- | --------------------- | ------------------------------------------------- | ------------------- |
| **SFT**          | Alpaca                | <p>CSV<br>JSON<br>JSONLINES<br>ZIP<br>PARQUET</p> | Limit 100MB         |
| **SFT**          | ShareGPT              | <p>JSON<br>JSONLINES<br>ZIP<br>PARQUET</p>        | Limit 100MB         |
| **SFT**          | ShareGPT\_Image       | <p>ZIP<br>PARQUET</p>                             | Limit 100MB         |
| **DPO**          | ShareGPT              | <p>JSON<br>JSONLINES<br>ZIP<br>PARQUET</p>        | Limit 100MB         |
| **Pre-training** | Corpus                | <p>TXT<br>JSON<br>JSONLINES<br>ZIP<br>PARQUET</p> | Limit 100MB         |

* Connect to Data Hub
  1. Click **Data Hub**
  2. Select a connection or dataset from the Data Hub. **Notice:** Ensure the dataset is compatible with the selected format.
  3. (Optional) Click **Open Data Hub** to preview or manage datasets.
  4. (Optional) Click **Reload icon** to update connection and dataset list.
  5. Follow the detailed guide Data Hub


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ai-docs.fptcloud.com/fpt-ai-studio/services/model-fine-tuning/tutorials/how-to-create-a-pipeline/select-dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
