Unity Catalog allows you to apply fine-grained security to tables and to securely access them from any language, all while interacting seamlessly with other machine-learning components in Databricks. This article shows how to use Python to train a machine-learning model using data in Unity Catalog.
Your Databricks account must be on the Premium plan.
You must have the ability to create a cluster or access to a cluster running in a Unity Catalog compliant access mode.
Follow these steps to create a Single-User Databricks Machine Learning cluster that can access data in Unity Catalog.
Click Create cluster.
Select either 11.1 ML (Scala 2.12.14, Spark 3.3.0) or higher, or 11.1 ML (GPU, Scala 2.12.14, Spark 3.3.0) or higher.
Click Access Mode. Set Single User or Shared depending on use.
Shared clusters can be shared by multiple users, but only SQL and Python workloads are supported.
To run workloads using Python, Scala, or R, set the access mode to single user. Single user clusters can also run SQL workloads. The cluster can be used exclusively by a single user (by default, the single user is the owner of the cluster) and other users can’t attach to the cluster.
For more information about the features available in each access mode, see What is cluster access mode?.
Click Create Cluster.
Follow these steps to create a new catalog where your machine learning team can store their data assets.
In a workspace with the metastore assigned, log in as the metastore admin, or as a user with the
Create a notebook or open the Databricks SQL editor.
Run the following command to create the
CREATE CATALOG ml;
When you create a catalog, a schema named
defaultis automatically created within it.
Grant access to the
mlcatalog and the
ml.defaultschema, and the ability to create tables and views, to the
ml_teamgroup. To include all account level users, you could use the group
GRANT USAGE ON CATALOG ml TO `ml team`; GRANT USAGE, CREATE ON SCHEMA ml.default TO `ml_team`;
Now, any user in the
ml_team group can run the following example notebook.
To get started, import the following notebook.
To import the notebook:
Next to the notebook, click Copy link for import.
In your workspace, click Workspace.
Next to a folder, click , then click Import
Click URL, then paste in the link you copied.
The imported notebook appears in the folder you selected. Double-click the notebook name to open it.
At the top of the notebook, select your Databricks Machine Learning cluster to attach the notebook to it.
The notebook is divided into several high-level sections:
Read data from CSV files and writing it to Unity Catalog.
Load the data into Pandas dataframes and clean it up.
Train a basic classification model.
Tune hyperparameters and optimize the model.
Write the results to a new table and share it with other users.
To run a cell, click Run. To run the entire notebook, click Run All.