Databricks Connect for Python

Note

This article covers Databricks Connect for Databricks Runtime 13.0 and above.

This article demonstrates how to quickly get started with Databricks Connect by using Python and PyCharm. For the Scala version of this article, see Databricks Connect for Scala.

Databricks Connect enables you to connect popular IDEs such as PyCharm, notebook servers, and other custom applications to Databricks clusters. See What is Databricks Connect?.

Tutorial

To skip this tutorial and use a different IDE instead, see Next steps.

Requirements

To complete this tutorial, you must meet the following requirements:

  • Your target Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.

  • You must have your cluster ID available. To get your cluster ID, in your workspace, click Compute on the sidebar. In your web browser’s address bar, copy the string of characters between clusters and configuration in the URL.

  • You have PyCharm installed.

  • You have Python 3 installed on your development machine, and the minor version of your client Python installation is the same as the minor Python version of your Databricks cluster. The following table shows the Python version installed with each Databricks Runtime.

    Databricks Runtime version

    Python version

    13.0 ML - 14.0 ML, 13.0 - 14.0

    3.10

Step 1: Create a personal access token

This tutorial uses Databricks personal access token authentication and a Databricks configuration profile for authenticating with your Databricks workspace.

If you already have a Databricks personal access token and a matching Databricks configuration profile, skip to Step 3. If you are not sure whether you already have a Databricks personal access token, you can follow this step without affecting any other Databricks personal access tokens in your user account.

Note

Databricks Connect supports OAuth authentication in addition to Databricks personal access token authentication. For OAuth authentication setup and configuration details, see Set up the client.

Databricks Connect also supports basic authentication. However, Databricks does not recommend basic authentication in production.

To create a personal access token:

  1. In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.

  2. Click Developer.

  3. Next to Access tokens, click Manage.

  4. Click Generate new token.

  5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).

  6. Click Generate.

  7. Copy the displayed token to a secure location, and then click Done.

Note

Be sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (Revoke) icon next to the token on the Access tokens page.

If you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following:

Step 2: Create an authentication configuration profile

Create a Databricks authentication configuration profile to store necessary information about your personal access token on your local machine. Databricks developer tools and SDKs can use this configuration profile to quickly authenticate with your Databricks workspace.

To create a profile:

Note

The following procedure creates a Databricks configuration profile with the name DEFAULT. If you already have a DEFAULT configuration profile that you want to use, then skip this procedure. Otherwise, this procedure overwrites your existing DEFAULT configuration profile.

To check whether you already have a DEFAULT configuration profile, and to view this profile’s settings if it exists, use the Databricks CLI to run the command databricks auth env --profile DEFAULT.

To create a configuration profile with a name other than DEFAULT, replace the DEFAULT part of --profile DEFAULT in the following databricks configure command with a different name for the configuration profile.

  1. Install the Databricks CLI, if it is not already installed, as follows:

    Use Homebrew to install the Databricks CLI by running the following two commands:

    brew tap databricks/tap
    brew install databricks
    
    1. Install curl and zip for Bash or WSL on Windows. For more information, see your operating system’s documentation.

    2. Install the Databricks CLI by running the following command:

      curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
      
  2. Confirm that the Databricks CLI is installed by running the following command, which displays the current version of the installed Databricks CLI:

    databricks -v
    
  3. Create a Databricks configuration profile named DEFAULT that uses Databricks personal access token authentication. To do this, use the Databricks CLI to run the following command:

    databricks configure --profile DEFAULT
    
  4. For the prompt Databricks Host, enter your Databricks workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com.

  5. For the prompt Personal Access Token, enter the Databricks personal access token for your workspace.

Step 3: Create the project

  1. Start PyCharm.

  2. On the main menu, click File > New Project.

  3. For Location, click the folder icon, and complete the on-screen directions to specify the path to your new Python project.

  4. Expand Python interpreter: New environment.

  5. Click the New environment using option.

  6. In the drop-down list, select Virtualenv.

  7. Leave Location with the suggested path to the venv folder.

  8. For Base interpreter, use the drop-down list or click the ellipses to specify the path to the Python interpreter from the preceding requirements.

  9. Click Create.

Create the PyCharm project

Step 4: Add the Databricks Connect package

  1. On PyCharm’s main menu, click View > Tool Windows > Python Packages.

  2. In the search box, enter databricks-connect.

  3. In the PyPI repository list, click databricks-connect.

  4. In the result pane’s latest drop-down list, select the version that matches your cluster’s Databricks Runtime version. For example, if your cluster has Databricks Runtime 13.3 LTS installed, select 13.3.0.

  5. Click Install.

  6. After the package installs, you can close the Python Packages window.

Install the Databricks Connect package

Step 5: Add code

  1. In the Project tool window, right-click the project’s root folder, and click New > Python File.

  2. Enter main.py and double-click Python file.

  3. Enter the following code into the file and then save the file:

    from databricks.connect import DatabricksSession
    
    spark = DatabricksSession.builder.getOrCreate()
    
    df = spark.read.table("samples.nyctaxi.trips")
    df.show(5)
    

Step 6: Set the DATABRICKS_CLUSTER_ID environment variable

  1. In the Project tool window, right-click the main.py file, and click Modify Run Configuration.

  2. On the dialog’s Configuration tab, next to Environment variables, click the edit (Edit environment variables) icon.

  3. In User environment variables, click the plus (Add) icon.

  4. For Name, enter DATABRICKS_CLUSTER_ID.

  5. For Value, enter the cluster ID from this tutorial’s requirements.

  6. Click OK.

  7. Close the Edit Run Configuration dialog by clicking OK.

Set the DATABRICKS_CLUSTER_ID environment variable

Step 7: Run the code

  1. Start the target cluster in your remote Databricks workspace.

  2. After the cluster has started, on the main menu, click Run > Run ‘main’.

  3. In the Run tool window (View > Tool Windows > Run), in the Run tab’s main pane, the first 5 rows of the samples.nyctaxi.trips appear.

Step 8: Debug the code

  1. With the cluster still running, in the preceding code, click the gutter next to df.show(5) to set a breakpoint.

  2. On the main menu, click Run > Debug ‘main’.

  3. In the Debug tool window (View > Tool Windows > Debug), in the Debugger tab’s Variables pane, expand the df and spark variable nodes to browse information about the code’s df and spark variables.

  4. In the Debug tool window’s sidebar, click the green arrow (Resume Program) icon.

  5. In the Debugger tab’s Console pane, the first 5 rows of the samples.nyctaxi.trips appear.

Debug the PyCharm project

Next steps

To learn more about Databricks Connect, see articles such as the following: