Work with files on Databricks

Databricks provides multiple utilities and APIs for interacting with files in the following locations:

  • Unity Catalog volumes.

  • Workspace files.

  • Cloud object storage.

  • DBFS mounts and DBFS root.

  • Ephemeral storage attached to the driver node of the cluster.

This article provides examples for interacting with files in these locations for the following tools:

  • Apache Spark.

  • Spark SQL and Databricks SQL.

  • Databricks file system utitlities (dbutils.fs or %fs).

  • Databricks CLI.

  • Databricks REST API.

  • Bash shell commands (%sh).

  • Notebook-scoped library installs using %pip.

  • Pandas.

  • OSS Python file management and processing utilities.

Important

File operations that require FUSE access to data cannot directly access cloud object storage using URIs. Databricks recommends using Unity Catalog volumes to configure access to these locations for FUSE.

Scala does not support FUSE for Unity Catalog volumes or workspace files on compute configured with single user access mode or clusters without Unity Catalog. Scala supports FUSE for Unity Catalog volumes and workspace files on compute configured with Unity Catalog and shared access mode.

Do I need to provide a URI scheme to access data?

Data access paths in Databricks follow one of the following standards:

  • URI-style paths include a URI scheme. For Databricks-native data access solutions, URI schemes are optional for most use cases. When you directly access data in cloud object storage, you must provide the correct URI scheme for the storage type.

    URI paths diagram
  • POSIX-style paths provide data access relative to the driver root (/). POSIX-style paths never require a scheme. You can use Unity Catalog volumes or DBFS mounts to provide POSIX-style access to data in cloud object storage. Many ML frameworks and other OSS Python modules require FUSE and can only use POSIX-style paths.

    POSIX paths diagram

Work with files in Unity Catalog volumes

Databricks recommends using Unity Catalog volumes to configure access to non-tabular data files stored in cloud object storage. See Create and work with volumes.

Tool

Example

Apache Spark

spark.read.format("json").load("/Volumes/my_catalog/my_schema/my_volume/data.json").show()

Spark SQL and Databricks SQL

SELECT * FROM csv.`/Volumes/my_catalog/my_schema/my_volume/data.csv`; LIST '/Volumes/my_catalog/my_schema/my_volume/';

Databricks file system utilities

dbutils.fs.ls("/Volumes/my_catalog/my_schema/my_volume/") %fs ls /Volumes/my_catalog/my_schema/my_volume/

Databricks CLI

databricks fs cp /path/to/local/file dbfs:/Volumes/my_catalog/my_schema/my_volume/

Databricks REST API

POST https://<databricks-instance>/api/2.1/jobs/create {"name": "A multitask job", "tasks": [{..."libraries": [{"jar": "/Volumes/dev/environment/libraries/logging/Logging.jar"}],},...]}

Bash shell commands

%sh curl http://<address>/text.zip -o /Volumes/my_catalog/my_schema/my_volume/tmp/text.zip

Library installs

%pip install /Volumes/my_catalog/my_schema/my_volume/my_library.whl

Pandas

df = pd.read_csv('/Volumes/my_catalog/my_schema/my_volume/data.csv')

OSS Python

os.listdir('/Volumes/my_catalog/my_schema/my_volume/path/to/directory')

Note

The dbfs:/ schema is required when working with the Databricks CLI.

Work with workspace files

You can use workspace files to store and access data and other files saved alongside notebooks and other workspace assets. Because workspace files have size restrictions, Databricks recommends only storing small data files here primarily for development and testing.

Tool

Example

Apache Spark

spark.read.format("json").load("file:/Workspace/Users/<user-folder>/data.json").show()

Spark SQL and Databricks SQL

SELECT * FROM json.`file:/Workspace/Users/<user-folder>/file.json`;

Databricks file system utilities

dbutils.fs.ls("file:/Workspace/Users/<user-folder>/") %fs ls file:/Workspace/Users/<user-folder>/

Databricks CLI

databricks workspace list

Databricks REST API

POST https://<databricks-instance>/api/2.0/workspace/delete {"path": "/Workspace/Shared/code.py", "recursive": "false"}

Bash shell commands

%sh curl http://<address>/text.zip -o /Workspace/Users/<user-folder>/text.zip

Library installs

%pip install /Workspace/Users/<user-folder>/my_library.whl

Pandas

df = pd.read_csv('/Workspace/Users/<user-folder>/data.csv')

OSS Python

os.listdir('/Workspace/Users/<user-folder>/path/to/directory')

Note

The file:/ schema is required when working with Databricks Utilities, Apache Spark, or SQL.

You cannot use Apache Spark to read or write to workspace files on cluster configured with shared access mode.

Work with files in cloud object storage

Databricks recommends using Unity Catalog volumes to configure secure access to files in cloud object storage. If you choose to directly access data in cloud object storage using URIs, you must configure permissions. See Manage external locations, external tables, and external volumes.

The following examples use URIs to access data in cloud object storage:

Tool

Example

Apache Spark

spark.read.format("json").load("s3://<bucket>/path/file.json").show()

Spark SQL and Databricks SQL

SELECT * FROM csv.`s3://<bucket>/path/file.json`; LIST 's3://<bucket>/path';

Databricks file system utilities

dbutils.fs.ls("s3://<bucket>/path/") %fs ls s3://<bucket>/path/

Databricks CLI

Not supported

Databricks REST API

Not supported

Bash shell commands

Not supported

Library installs

%pip install s3://bucket-name/path/to/library.whl

Pandas

Not supported

OSS Python

Not supported

Work with files in DBFS mounts and DBFS root

DBFS mounts are not securable using Unity Catalog and are no longer recommended by Databricks. Data stored in the DBFS root is accessible by all users in the workspace. Databricks recommends against storing any sensitive or production code or data in the DBFS root. See What is the Databricks File System (DBFS)?.

Tool

Example

Apache Spark

spark.read.format("json").load("/mnt/path/to/data.json").show()

Spark SQL and Databricks SQL

SELECT * FROM json.`/mnt/path/to/data.json`;

Databricks file system utilities

dbutils.fs.ls("/mnt/path") %fs ls /mnt/path

Databricks CLI

databricks fs cp dbfs:/mnt/path/to/remote/file /path/to/local/file

Databricks REST API

POST https://<host>/api/2.0/dbfs/delete --data '{ "path": "/tmp/HelloWorld.txt" }'

Bash shell commands

%sh curl http://<address>/text.zip > /dbfs/mnt/tmp/text.zip

Library installs

%pip install /dbfs/mnt/path/to/my_library.whl

Pandas

df = pd.read_csv('/dbfs/mnt/path/to/data.csv')

OSS Python

os.listdir('/dbfs/mnt/path/to/directory')

Note

The dbfs:/ schema is required when working with the Databricks CLI.

Work with files in ephemeral storage attached to the driver node

The ephermal storage attached to the drive node is block storage with native POSIX-based path access. Any data stored in this location disappears when a cluster terminates or restarts.

Tool

Example

Apache Spark

Not supported

Spark SQL and Databricks SQL

Not supported

Databricks file system utilities

dbutils.fs.ls("file:/path") %fs ls file:/path

Databricks CLI

Not supported

Databricks REST API

Not supported

Bash shell commands

%sh curl http://<address>/text.zip > /tmp/text.zip

Library installs

Not supported

Pandas

df = pd.read_csv('/path/to/data.csv')

OSS Python

os.listdir('/path/to/directory')

Note

The file:/ schema is required when working with Databricks Utilities.

Move data from ephemeral storage to volumes

You might want to access data downloaded or saved to ephemeral storage using Apache Spark. Because ephemeral storage is attached to the driver and Spark is a distributed processing engine, not all operations can directly access data here. If you need to move data from the driver filesystem to Unity Catalog volumes, you can copy files using magic commands or the Databricks utilities, as in the following examples

dbutils.fs.cp ("file:/<path>", "/Volumes/<catalog>/<schema>/<volume>/<path>")
%sh cp /<path> /Volumes/<catalog>/<schema>/<volume>/<path>
%fs cp file:/<path> /Volumes/<catalog>/<schema>/<volume>/<path>

Where do deleted files go?

Deleting a workspace file sends it to the trash. You can either recover or permanently delete files from the trash using the UI.

See Delete an object.

Local file API limitations

The following lists the limitations on local file API usage with cloud object storage in Databricks Runtime.

  • Does not support Amazon S3 mounts with client-side encryption enabled.

  • Databricks has limited support for workspace file operations from serverless compute and from clusters with UC access modes (Assigned and Shared, respectively). Databricks is working on enabling the currently unsupported combinations.

Language

Cluster type

Driver or UDF?

Supported?

Python

Assigned

Driver

Yes

Python

Assigned

UDF

Yes

Python

Shared

Driver

Yes

Python

Shared

UDF

No

Scala

Assigned

Driver

No

Scala

Assigned

UDF

No

Scala

Shared

Driver

Yes

Scala

Shared

UDF

No

  • If your workflow uses source code located in a remote Git repository, you cannot write to the current directory or write using a relative path. Write data to the other location options outlined in Work with files in cloud object storage.

  • No direct append operations.

    Since the underlying storage does not support append, Databricks would have to download the data, run the append, and reupload the data in order to support the command. This works for small files, but quickly becomes an issue as file size increases.

  • No non-sequential (random) writes, such as writing Zip and Excel files.

    For direct-append or random-write workloads, perform the operations on a local disk first and then copy the results to Unity Catalog volumes. For example:

    # python
    import xlsxwriter
    from shutil import copyfile
    
    workbook = xlsxwriter.Workbook('/local_disk0/tmp/excel.xlsx')
    worksheet = workbook.add_worksheet()
    worksheet.write(0, 0, "Key")
    worksheet.write(0, 1, "Value")
    workbook.close()
    
    copyfile('/local_disk0/tmp/excel.xlsx', '/Volumes/my_catalog/my_schema/my_volume/excel.xlsx')
    
  • No sparse files. To copy sparse files, use cp --sparse=never:

    $ cp sparse.file /Volumes/my_catalog/my_schema/my_volume/sparse.file
    error writing '/dbfs/sparse.file': Operation not supported
    $ cp --sparse=never sparse.file /Volumes/my_catalog/my_schema/my_volume/sparse.file
    
  • Executors cannot write to workspace files.

  • Workspace file size is limited to 200MB. Operations that attempt to download or create files larger than this limit fail.

  • You cannot use git commands when you save to workspace files. The creation of .git directories is not allowed in workspace files.

  • No symlinks.

Enable workspace files

Databricks workspace files are the set of files in a workspace that are not notebooks. To enable support for non-notebook files in your Databricks workspace, call the /api/2.0/workspace-conf REST API from a notebook or other environment with access to your Databricks workspace. Workspace files are enabled by default.

To enable or re-enable support for non-notebook files in your Databricks workspace, call the /api/2.0/workspace-conf and get the value of the enableWorkspaceFileSystem key. If it is set to true, non-notebook files are already enabled for your workspace.

The following example demonstrates how you can call this API from a notebook to check if workspace files are disabled and if so, re-enable them. To disable workspace files, set enableWorkspaceFilesystem to false with the /api/2.0/workspace-conf API.

Example: Notebook for re-enabling Databricks workspace file support

Open notebook in new tab