Load data for machine learning and deep learning

This section covers information about loading data specifically for ML and DL applications. For general information about loading data, see Standard connectors in Lakeflow Connect.

Store files for data loading and model checkpointing

Machine learning applications may need to use shared storage for data loading and model checkpointing. This is particularly important for distributed deep learning.

Databricks provides Unity Catalog, a unified governance solution for data and AI assets. You can use Unity Catalog for accessing data on a cluster using both Spark and local file APIs.

Load tabular data

You can load tabular machine learning data from tables or files (for example, see Read and write CSV files). You can convert Apache Spark DataFrames into pandas DataFrames using the PySpark method toPandas(), and then optionally convert to NumPy format using the PySpark method to_numpy().

Prepare data to fine tune large language models

You can prepare your data for fine-tuning open source large language models with Hugging Face Transformers and Hugging Face Datasets.

Prepare data for fine tuning Hugging Face models

Prepare data for distributed deep learning training

This section covers preparing data for distributed deep learning training.

For very large datasets that do not fit in memory, use streaming approaches:

PyTorch IterableDataset for custom streaming logic.
Hugging Face datasets with streaming for datasets hosted on the Hub or in volumes.
Ray Data for distributed batch data processing.

Store files for data loading and model checkpointing​

Load tabular data​

Prepare data to fine tune large language models​

Prepare data for distributed deep learning training​

Store files for data loading and model checkpointing

Load tabular data

Prepare data to fine tune large language models

Prepare data for distributed deep learning training