Overview

Data Hub is the centralized data management module within AI Studio. It enables users to store, organize, version, and prepare datasets used across the AI model lifecycle — including fine-tuning, testing, and benchmarking.

By integrating seamlessly with other AI Studio services such as Model Fine-tuning and Model Testing, Data Hub ensures that your datasets remain consistent, traceable, and reusable.

Key Capabilities

Feature
Description

Dataset Management

Upload, list, and organize datasets with metadata (name, description and data format).

Secure Storage

Provides scalable, encrypted storage for structured and unstructured data.

Data Access Integration

Supports direct linkage to fine-tuning and testing jobs without manual file handling.

Presigned URL Uploads

Enables large dataset uploads efficiently through presigned URLs or API endpoints.

Search & Filtering

Quickly find datasets by name or creation date using flexible filters.

Supported Data Types

Data Hub supports a wide range of file types commonly used in machine learning workflows:

  • Data format: Alpaca, ShareGPT, ShareGPT_Image, Corpus

  • Structured data: CSV, JSON, Parquet

  • Text data: TXT, JSONL

  • Unstructured data (optional): Images or documents used for multimodal fine-tuning

Each dataset must include a defined schema or format compatible with your chosen trainer.

Integration Across AI Studio

Data Hub serves as the data foundation for all modules in AI Studio:

Module
How It Uses Data Hub

Model Fine-tuning

Accesses datasets to train or adapt pretrained models.

Model Testing

Retrieves evaluation or benchmark datasets for validation.

This tight integration ensures complete lineage tracking — from dataset to model to deployed endpoint.

Access Methods

You can interact with Data Hub through multiple interfaces:

  1. AI Studio Console – Web-based interface for uploading and managing datasets.

  2. AI Studio API – RESTful API for programmatic dataset operations (upload, list, delete, etc.).

Typical Workflow

  1. Upload your dataset to Data Hub.

  2. Describe it for easy identification.

  3. Reference it when creating a fine-tuning or testing job.

Benefits

  • Centralized and secure data management

  • Automated dataset versioning and lineage tracking

  • Faster access for model training and testing

  • Reduced duplication across teams and projects

  • Simplified compliance and reproducibility

Next Steps

  • Learn how to upload and organize datasets in Data Hub Tutorial.

  • Continue to Fine-tune a model using your dataset in the Quickstart Guide

Last updated