Train a machine-learning model with Python from data in Unity Catalog

Preview

Unity Catalog is in Public Preview. To participate in the preview, contact your Databricks representative.

Unity Catalog allows you to apply fine-grained security to tables and to securely access them from any language, all while interacting seamlessly with other machine-learning components in Databricks. This article shows how to use Python to train a machine-learning model using data in Unity Catalog.

Requirements

  • Your Databricks account must be on the Premium plan.

  • You must be an account admin or the metastore admin for the metastore you use to train the model.

Create a Databricks Machine Learning cluster

Follow these steps to create a Single-User Databricks Machine Learning cluster that can access data in Unity Catalog.

To create a Databricks Machine Learning cluster that can access Unity Catalog:

  1. Log in to the workspace as a workspace-level admin.

  2. In the Data Science & Engineering or Databricks Machine Learning persona, click compute icon Compute.

  3. Click Create cluster.

    1. Enter a name for the cluster.

    2. For Databricks runtime version:

      1. Click ML.

      2. Select either 10.3 ML (Scala 2.12, Spark 3.2.1) or higher, or 10.3 ML (GPU, Scala 2.12, Spark 3.2.1) or higher.

  4. Click Advanced Options. Set Security Mode to User Isolation or Single User. To run Python code, you must use Single User.

    User Isolation clusters can be shared by multiple users, but only SQL workloads are supported. Some advanced cluster features such as library installation, init scripts, and the DBFS Fuse mount are also disabled to ensure security isolation among cluster users.

    To use those advanced cluster features or languages or to run workloads using Python, Scala and R, set the cluster mode to Single User. Single User cluster can also run SQL workloads. The cluster can be used exclusively by a single user (by default the owner of the cluster); other users cannot attach to the cluster. Automated jobs should run in this mode, and the job’s owner should be the cluster’s owner. In this mode, view security cannot be enforced. A user selecting from a view executes with their own permissions.

    For more information about the features available in each security mode, see Cluster security mode.

  5. Click Create Cluster.

Create the catalog

Follow these steps to create a new catalog where your machine learning team can store their data assets.

  1. As an account admin or the metastore admin, log in to a workspace with the metastore assigned.

  2. Create a notebook or open the Databricks SQL editor.

  3. Run the following command to create the ml catalog:

    CREATE CATALOG ml;
    

    When you create a catalog, a schema named default is automatically created within it.

  4. Grant access to the ml catalog and the ml.default schema, and the ability to create tables and views, to the ml_team group. To include all account level users, you could use the group account users.

    GRANT USAGE ON CATALOG ml TO `ml team`;
    GRANT USAGE, CREATE ON SCHEMA ml.default TO `ml_team`;
    

Now, any user in the ml_team group can run the following example notebook.

Import the example notebook

To get started, import the following notebook.

Machine learning with Unity Catalog

Open notebook in new tab

To import the notebook:

  1. Next to the notebook, click Copy link for import.

  2. In your workspace, click Workspace Icon Workspace.

  3. Next to a folder, click Down Caret, then click Import

  4. Click URL, then paste in the link you copied.

  5. The imported notebook appears in the folder you selected. Double-click the notebook name to open it.

  6. At the top of the notebook, select your Databricks Machine Learning cluster to attach the notebook to it.

The notebook is divided into several high-level sections:

  1. Setup.

  2. Read data from CSV files and writing it to Unity Catalog.

  3. Load the data into Pandas dataframes and clean it up.

  4. Train a basic classification model.

  5. Tune hyperparameters and optimize the model.

  6. Write the results to a new table and share it with other users.

To run a cell, click Run Icon Run. To run the entire notebook, click Run All.