Work with files in Unity Catalog volumes
This page has examples for managing files in Unity Catalog volumes for various user interfaces, tools, libraries, and languages.
Databricks recommends using volumes for managing all access to non-tabular data in cloud object storage and for storing workload support files. Examples include the following:
- Data files for ingestion, such as CSV, JSON, and Parquet
- Text, image, and audio files for data science, ML, and AI workloads
- CSV or JSON artifacts written by Databricks for integration with external systems
- Libraries, init scripts, and build artifacts
Volumes provide Portable Operating System Interface (POSIX)-style paths that work with Filesystem in Userspace (FUSE)-dependent tools and frameworks. This makes them ideal for machine learning frameworks and open source Python modules that require POSIX-style access. For detailed information about URI schemes, POSIX paths, and how they relate to volumes, see Do I need to provide a URI scheme to access data?.
Methods for managing files in volumes
For quick examples of each method, see Work with files in Unity Catalog volumes.
Interface | Description |
---|---|
Interactive file management through the Databricks workspace | |
Read and write files using Apache Spark, Pandas, or SQL | |
File operations using | |
File operations using SQL keywords ( | |
Command-line operations using | |
File operations using Python, Java, or Go SDKs | |
Direct API access for custom integrations |
Use Catalog Explorer
Catalog Explorer has options for common file management tasks for files stored with Unity Catalog volumes.
To interact with files in a volume, do the following:
- In your Databricks workspace, click
Catalog.
- Search or browse for the volume that you want to work with and select it.
For details on creating and managing volumes, see Create and manage Unity Catalog volumes.
Upload files to a volume
You can upload files of any format—structured, semi-structured, or unstructured—to a volume. When you upload through the UI, there's a 5 GB file size limit. However, volumes themselves support files up to the maximum size supported by the underlying cloud storage. You can write very large files using Spark, and upload large files using the Databricks API or SDKs.
Requirements
Before you upload to a volume, make sure you have the following:
- A workspace with Unity Catalog enabled
WRITE VOLUME
on the target volumeUSE SCHEMA
on the parent schemaUSE CATALOG
on the parent catalog
For details, see Unity Catalog privileges and securable objects.
Upload steps
- In Catalog Explorer, click Add data > Upload to volume.
- Click Browse or drop files into the drop zone.
- Select a volume or directory, or paste a volume path.
- If no volume exists in the target schema, create one.
- You can also create a new directory within the target volume.
You can also access the upload UI in the following ways:
- In the sidebar: New > Add data > Upload files to volume
- From a notebook: File > Upload files to volume
Next steps
After you upload to a volume, you can do the following:
- Create a Unity Catalog managed table from the files. See Create a table from data in a volume.
- Use the files in ML and data science workloads
- Configure cluster libraries, notebook-scoped libraries, or job dependencies using the uploaded files
- Ingest data for engineering pipelines using Auto Loader or COPY INTO
- Process files with AI functions such as
ai_parse_document
- Set up file arrival triggers in jobs
- Upload documents for use with AgentBricks (for example, knowledge assistant scenarios)
Download files from a volume
To download files from a volume, do the following:
- Select one or more files.
- Click Download to download these files.
Delete files from a volume
To delete files from a volume, do the following:
- Select one or more files.
- Click Delete.
- Click Delete to confirm in the dialog that appears.
Create a blank directory
To create a new directory in a volume, do the following:
- On the volume overview tab, click Create directory.
- Enter a directory name.
- Click Create.
Delete directories from a volume
To delete directories from a volume, do the following:
- Select one or more directories.
- Click Delete.
- Click Delete to confirm in the dialog that appears.
UI file management tasks for volumes
Click the kebab menu next to a file name to perform the following actions:
- Copy path
- Download file
- Delete file
- Create table
Create a table from data in a volume
Databricks provides a UI to create a Unity Catalog managed table from a file, files, or directory of files stored in a Unity Catalog volume.
You must have CREATE TABLE
permissions in the target schema and have access to a running SQL warehouse.
-
Select one or more files or a directory. Files should have the same data layout.
-
Click Create table. The Create table from volumes dialog appears.
-
Use the provided dialog to review a preview of the data and complete the following configurations:
- Choose to Create new table or Overwrite existing table
- Select the target Catalog and Schema.
- Specify the Table name.
- (Optional) Override default column names and types, or choose to exclude columns.
noteClick Advanced attributes to view additional options.
-
Click Create table to create the table with the specified attributes. Upon completion, Catalog Explorer displays the table details.
Programmatically work with files in volumes
Read and write files in volumes from all supported languages and workspace editors using the following format:
/Volumes/catalog_name/schema_name/volume_name/path/to/files
You interact with files in volumes in the same way that you interact with files in any cloud object storage location. That means that if you currently manage code that uses cloud URIs, DBFS mount paths, or DBFS root paths to interact with data or files, you can update your code to use volumes instead.
Volumes are only used for non-tabular data. Databricks recommends registering tabular data using Unity Catalog tables and then reading and writing data using table names.
Read and write data in volumes
Use Apache Spark, pandas, Spark SQL, and other OSS libraries to read and write data files in volumes.
The following examples demonstrate reading a CSV file stored in a volume:
- Python
- Pandas
- SQL
df = spark.read.format("csv").load("/Volumes/catalog_name/schema_name/volume_name/data.csv")
display(df)
import pandas as pd
df = pd.read_csv('/Volumes/catalog_name/schema_name/volume_name/data.csv')
display(df)
SELECT * FROM csv.`/Volumes/catalog_name/schema_name/volume_name/data.csv`
Utility commands for files in volumes
Databricks provides the following tools for managing files in volumes:
- The
dbutils.fs
submodule in Databricks Utilities. See File system utility (dbutils.fs). - The
%fs
magic, which is an alias fordbutils.fs
. - The
%sh
magic, which allows bash command against volumes.
For an example of using these tools to download files from the internet, unzip files, and move files from ephemeral block storage to volumes, see Download data from the internet.
You can also use OSS packages for file utility commands, such as the Python os
module, as shown in the following example:
import os
os.mkdir('/Volumes/catalog_name/schema_name/volume_name/directory_name')
Manage files in volumes from external tools
Databricks provides a suite of tools for programmatically managing files in volumes from your local environment or integrated systems.
SQL commands for files in volumes
Databricks supports the following SQL keywords for interacting with files in volumes:
In Databricks notebooks and the SQL query editor, only the LIST
command is supported. The other SQL commands (PUT INTO
, GET
, and REMOVE
) are available through the following Databricks SQL connectors and drivers, which support managing files in volumes:
- Databricks SQL Connector for Python
- Databricks SQL Driver for Go
- Databricks SQL Driver for Node.js
- Databricks JDBC driver
- Databricks ODBC driver
Manage files in volumes with the Databricks CLI
Use the subcommands in databricks fs
. See fs
command group.
The Databricks CLI requires the scheme dbfs:/
to precede all volumes paths. For example, dbfs:/Volumes/catalog_name/schema_name/volume_name/path/to/data
.
Manage files in volumes with SDKs
The following SDKs support managing files in volumes:
- The Databricks SDK for Python. Use the available methods in WorkspaceClient.files. For examples, see Manage files in Unity Catalog volumes.
- The Databricks SDK for Java. Use the available methods in WorkspaceClient.files. For examples, see Manage files in Unity Catalog volumes.
- The Databricks SDK for Go. Use the available methods in WorkspaceClient.files. For examples, see Manage files in Unity Catalog volumes.
Manage files in volumes with the REST API
Use the Files API to manage files in volumes.
REST API examples for files in volumes
The following examples use curl
and the Databricks REST API to perform file management tasks in volumes.
The following example creates an empty folder named my-folder
in the specified volume.
curl --request PUT "https://${DATABRICKS_HOST}/api/2.0/fs/directories/Volumes/main/default/my-volume/my-folder/" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
The following example creates a file named data.csv
with the specified data in the specified path in the volume.
curl --request PUT "https://${DATABRICKS_HOST}/api/2.0/fs/files/Volumes/main/default/my-volume/my-folder/data.csv?overwrite=true" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" \
--header "Content-Type: application/octet-stream" \
--data-binary $'id,Text\n1,Hello World!'
The following example lists the contents of a volume in the specified path. This example uses jq to format the response body's JSON for easier reading.
curl --request GET "https://${DATABRICKS_HOST}/api/2.0/fs/directories/Volumes/main/default/my-volume/" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" | jq .
The following example lists the contents of a folder in a volume in the specified path. This example uses jq to format the response body's JSON for easier reading.
curl --request GET "https://${DATABRICKS_HOST}/api/2.0/fs/directories/Volumes/main/default/my-volume/my-folder" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}" | jq .
The following example prints the contents of a file in the specified path in a volume.
curl --request GET "https://${DATABRICKS_HOST}/api/2.0/fs/files/Volumes/main/default/my-volume/my-folder/data.csv" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
The following example deletes a file in the specified path from a volume.
curl --request DELETE "https://${DATABRICKS_HOST}/api/2.0/fs/files/Volumes/main/default/my-volume/my-folder/data.csv" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
The following example deletes a folder from the specified volume.
curl --request DELETE "https://${DATABRICKS_HOST}/api/2.0/fs/directories/Volumes/main/default/my-volume/my-folder/" \
--header "Authorization: Bearer ${DATABRICKS_TOKEN}"
Limitations of working with files in volumes
Before working with files in volumes, consider the following limitations:
-
Direct-append or non-sequential (random) writes are not supported. This affects operations like writing Zip and Excel files. For these workloads:
- Perform the operations on a local disk first
- Copy the results to the volume
For example:
Python# python
import xlsxwriter
from shutil import copyfile
workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
worksheet = workbook.add_worksheet()
worksheet.write(0, 0, "Key")
worksheet.write(0, 1, "Value")
workbook.close()
copyfile('/local_disk0/tmp/excel.xlsx', '/Volumes/my_catalog/my_schema/my_volume/excel.xlsx') -
Sparse files are not supported. To copy sparse files, use
cp --sparse=never
:Bash$ cp sparse.file /Volumes/my_catalog/my_schema/my_volume/sparse.file
error writing '/dbfs/sparse.file': Operation not supported
$ cp --sparse=never sparse.file /Volumes/my_catalog/my_schema/my_volume/sparse.file