Ingest data into the Databricks Lakehouse

Databricks offers a variety of ways to help you ingest data into a lakehouse backed by Delta Lake.

Upload CSV files

You can securely upload local CSV files to create tables using Databricks SQL. See Upload data and create table in Databricks SQL.

Partner integrations

Databricks partner integrations enable you to load data into Databricks. These integrations enable low-code, scalable data ingestion from a variety of sources into Databricks. See Databricks integrations.

COPY INTO

COPY INTO allows SQL users to idempotently and incrementally load data from cloud object storage into Delta Lake tables. It can be used in Databricks SQL, notebooks, and Databricks Jobs.

Auto Loader

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.

Convert to Delta

Databricks provides a single command to convert Parquet or Iceberg tables to Delta Lake and unlock the full functionality of the lakehouse; see Convert to Delta Lake.

When to use COPY INTO and when to use Auto Loader

Here are a few things to consider when choosing between Auto Loader and COPY INTO:

  • If you’re going to ingest files in the order of thousands, you can use COPY INTO. If you are expecting files in the order of millions or more over time, use Auto Loader. Auto Loader requires fewer total operations to discover files compared to COPY INTO and can split the processing into multiple batches, meaning that Auto Loader is less expensive and more efficient at scale.

  • If your data schema is going to evolve frequently, Auto Loader provides better primitives around schema inference and evolution. See Configuring schema inference and evolution in Auto Loader for more details.

  • Loading a subset of re-uploaded files can be a bit easier to manage with COPY INTO. With Auto Loader, it’s harder to reprocess a select subset of files. However, you can use COPY INTO to reload the subset of files while an Auto Loader stream is running simultaneously.

For a brief overview and demonstration of Auto Loader, as well as COPY INTO, watch this YouTube video (2 minutes).

Use the Data tab to load data

The Data Science & Engineering workspace Data tab allows you to use the UI to load small files to create tables; see Explore and create tables with the Data tab.

Use Apache Spark to load data from external sources

You can connect to a variety of data sources using Apache Spark. See Data sources for a list of options and examples for connecting.

Review file metadata captured during data ingestion

Apache Spark automatically captures data about source files during data loading. Databricks lets you access this data with the File metadata column.