Best practices for DBFS and Unity Catalog
Unity Catalog introduces a number of new configurations and concepts that approach data governance entirely differently than DBFS. This article outlines several best practices around working with Unity Catalog external locations and DBFS.
Databricks recommends against using DBFS and mounted cloud object storage for most use cases in Unity Catalog-enabled Databricks workspaces. This article describes a few scenarios in which you should use mounted cloud object storage. Note that Databricks does not recommend using the DBFS root in conjunction with Unity Catalog, unless you must migrate files or data stored there into Unity Catalog.
How is DBFS used in Unity Catalog-enabled workspaces?
The DBFS root is the default location for storing files associated with a number of actions performed in the Databricks workspace, including creating managed tables in the workspace-scoped hive_metastore
. Actions performed against tables in the hive_metastore
use legacy data access patterns, which may include data and storage credentials managed by DBFS.
How does DBFS work in single user access mode?
Clusters configured with single user access mode have full access to DBFS, including all files in the DBFS root and mounted data. DBFS root and mounts are available in this access mode, making it the choice for ML workloads that need access to Unity Catalog datasets.
Databricks recommends using service principals with scheduled jobs and single user access mode for production workloads that need access to data managed by both DBFS and Unity Catalog.
Use DBFS while launching Unity Catalog clusters with single user access mode
Databricks recommends using DBFS mounts for init scripts, configurations, and libraries stored in external storage. This behavior is not supported in shared access mode.
Do not use DBFS with Unity Catalog external locations
Unity Catalog secures access to data in external locations by using full cloud URI paths to identify grants on managed object storage directories. DBFS mounts use an entirely different data access model that bypasses Unity Catalog entirely. Databricks recommends that you do not reuse cloud object storage volumes between DBFS mounts and UC external volumes.
Secure your Unity Catalog-managed storage
Each Unity Catalog metastore has an object storage account configured by a Databricks account administrator. Unity Catalog uses this location to store all data and metadata for Unity Catalog-managed tables.
A storage account used for a Unity Catalog metastore should:
Be created new for Unity Catalog.
Have a custom identity policy defined for Unity Catalog.
Only be accessible with Unity Catalog.
Only be accessed using the identity access policies created for Unity Catalog.
Add existing data to external locations
It is possible to load existing storage accounts into Unity Catalog using external locations. For greatest security, Databricks recommends only loading storage accounts to external locations if all other storage credentials and access patterns have been revoked.
You should never load a storage account used as a DBFS root as an external location in Unity Catalog.
Cluster configurations are ignored by Unity Catalog filesystem access
Unity Catalog does not respect cluster configurations for filesystem settings. This means that Hadoop filesystem settings for configuring custom behavior with cloud object storage do not work when accessing data using Unity Catalog.
Limitation around multiple path access
While you can generally use Unity Catalog and DBFS together, paths that are equal or share a parent/child relationship cannot be referenced in the same command or notebook cell using different access methods.
For example, if an external table foo
is defined in the hive_metastore
at location a/b/c
and an external location is defined in Unity Catalog on a/b/
, the following code would throw an error:
spark.read.table("foo").filter("id IS NOT NULL").write.mode("overwrite").save("a/b/c")
This error would not arise if this logic is broken into two cells:
df = spark.read.table("foo").filter("id IS NOT NULL")
df.write.mode("overwrite").save("a/b/c")