Debug code by using Databricks Connect for the Databricks extension for Visual Studio Code

This articles describes how to debug code by using the Databricks Connect integration in the Databricks extension for Visual Studio Code. See What is the Databricks extension for Visual Studio Code?

This information assumes that you have already installed and set up the Databricks extension for Visual Studio Code. See Install the Databricks extension for Visual Studio Code.

Note

This feature is Experimental.

Databricks Connect integration within the Databricks extension for Visual Studio Code supports only a portion of the Databricks client unified authentication standard. For more information, see Authentication setup for the Databricks extension for Visual Studio Code.

The Databricks extension for Visual Studio Code includes Databricks Connect. You can use Databricks Connect from within the Databricks extension for Visual Studio Code to run and do step-through debugging of individual Python (.py) files and Python Jupyter notebooks (.ipynb). The Databricks extension for Visual Studio Code includes Databricks Connect for Databricks Runtime 13.0 and above. Earlier versions of Databricks Connect are not supported.

Requirements

Before you can use Databricks Connect from within the Databricks extension for Visual Studio Code, you must first meet the Databricks Connect requirements. These requirements include things such as a workspace enabled with Unity Catalog, a cluster running Databricks Runtime 13.0 or higher and with a cluster access mode of Single User or Shared, and a local version of Python installed with its major and minor versions matching those of Python installed on the cluster.

Step 1: Create a Python virtual environment

Create and activate a Python virtual environment for your Python code project. Python virtual environments help to make sure that your code project is using compatible versions of Python and Python packages (in this case, the Databricks Connect package). The instructions and examples in this article use venv for Python virtual environments. To create a Python virtual environment using venv:

  1. From your Visual Studio Code terminal (View > Terminal) set to the root directory of your Python code project, instruct venv to use Python for the virtual environment, and then create the virtual environment’s supporting files in a hidden directory named .venv within the root directory of your Python code project, by running the following command:

    # Linux and macOS
    python3.10 -m venv ./.venv
    # Windows
    python3.10 -m venv .\.venv
    

    The preceding command uses Python 3.10, which matches the major and minor version of Python that Databricks Runtime 13.0 uses. Be sure to use the major and minor version of Python that matches your cluster’s installed version of Python.

  2. If Visual Studio Code displays the message “We noticed a new environment has been created. Do you want to select it for the workspace folder,” click Yes.

  3. Use venv to activate the virtual environment. See the venv documentation for the correct command to use, based on your operating system and terminal type. For example, on macOS running zsh:

    source ./.venv/bin/activate
    

    You will know that your virtual environment is activated when the virtual environment’s name (for example, .venv) displays in parentheses just before your terminal prompt.

    To deactivate the virtual environment at any time, run the command deactivate.

    You will know that your virtual environment is deactivated when the virtual environment’s name no longer displays in parentheses just before your terminal prompt.

Step 2: Update your Python code to establish a debugging context

To establish a debugging context between Databricks Connect and your cluster, your Python code must initialize the DatabricksSession class by calling DatabricksSession.builder.getOrCreate().

Note that you do not need to specify settings such as your workspace’s instance name, an access token, or your cluster’s ID and port number when you initialize the DatabricksSession class. Databricks Connect gets this information from the configuration details that you already provided through the Databricks extension for Visual Studio Code earlier in this article.

For additional information about initializing the DatabricksSession class, see the Databricks Connect code examples.

Important

If you use the Databricks extension for Visual Studio Code to set the authentication type to personal access tokens, then the extension sets a related SPARK_REMOTE environment variable with debugging context settings for use by Databricks Connect. These debugging context settings include the related workspace instance name, personal access token, and cluster ID.

In Databricks Connect, you can use the DatabricksSession or SparkSession class along with SPARK_REMOTE and personal access token authentication to quickly and easily establish the debugging context programmatically. For other supported Databricks authentication types, you can use only the DatabricksSession class to establish the debugging context.

For more information, see Set up the client in the Databricks Connect documentation.

Step 3: Enable Databricks Connect

With the extension opened and the Workspace section configured for your code project, do the following:

  1. In the Visual Studio Code status bar, click the red Databricks Connect disabled button.

  2. If the Cluster section is not already configured in the extension, the following message appears: “Please attach a cluster to use Databricks Connect.” Click Attach Cluster and select a cluster that meets the Databricks Connect requirements.

  3. If the Cluster section is configured but the cluster is not compatible with Databricks Connect, click the red Databricks Connect disabled button, click Attach Cluster, and select a compatible cluster.

  4. If the Databricks Connect package is not already installed, the following message appears: “For interactive debugging and autocompletion you need Databricks Connect. Would you like to install it in the environment <environment-name>.” Click Install.

  5. In the Visual Studio Code status bar, the blue Databricks Connect enabled button appears.

    If the red Databricks Connect disabled button still appears, click it, and complete the on-screen instructions to get the blue Databricks Connect enabled button to appear.

  6. After the blue Databricks Connect enabled button appears, you are now ready to use Databricks Connect.

Note

You do not need to configure the extension’s Sync Destination section in order for your code project to use Databricks Connect.

Step 4: Run or debug your Python code

After you enable Databricks Connect for your code project, run or debug your Python file or notebook as follows.

To run or debug a Python (.py) file:

  1. In your code project, open the Python file that you want to run or debug.

  2. Set any debugging breakpoints within the Python file.

  3. In the file editor’s title bar, click the drop-down arrow next to the play (Run or Debug) icon. Then in the drop-down list, select Debug Python File. This choice supports step-through debugging, breakpoints, watch expressions, call stacks, and similar features. This choice uses Databricks Connect to run Python code locally, run PySpark code on the cluster in the remote workspace, and send remote responses back to Visual Studio Code for local debugging.

    Note

    Other choices, which do not support debugging, include:

    • Run Python File to use Databricks Connect to run the file or notebook, but without debugging support. This choice sends the file to the remote workspace, runs the file’s Python and PySpark code on the remote cluster in the workspace, and sends the remote response to the Visual Studio Code Terminal.

    • Upload and Run File on Databricks to send the file to the remote workspace, runs the file’s Python and PySpark code on the remote cluster in the workspace, and sends the remote response to the Visual Studio Code terminal. This choice does not use Databricks Connect.

    • Run File as Workflow on Databricks to send the file to the remote workspace, run the file’s Python and PySpark code on the cluster that is associated with an automated Databricks job, and send the results to an editor in Visual Studio Code. This choice does not use Databricks Connect.

    Run File on Databricks editor command 0

    The Run Current File in Interactive Window option, if available, attempts to run the file locally in a special Visual Studio Code interactive editor. Databricks does not recommend this option.

To run or debug a Python Jupyter notebook (.ipynb):

  1. In your code project, open the Python Jupyter notebook that you want to run or debug. Make sure the Python file is in Jupyter notebook format and has the extension .ipynb.

    Tip

    You can create a new Python Jupyter notebook by running the >Create: New Jupyter Notebook command from within the Command Palette.

  2. Click Run All Cells to run all cells without debugging, Execute Cell to run an individual corresponding cell without debugging, or Run by Line to run an individual cell line-by-line with limited debugging, with variable values displayed in the Jupyter panel (View > Open View > Jupyter).

    For full debugging within an individual cell, set breakpoints, and then click Debug Cell in the menu next to the cell’s Run button.

    After you click any of these options, you might be prompted to install missing Python Jupyter notebook package dependencies. Click to install.

    For more information, see Jupyter Notebooks in VS Code.

Additional notebook features with Databricks Connect

Note

These features are Experimental.

These features are available in Databricks extension for Visual Studio Code 1.1.0 and above, unless otherwise noted.

The Databricks extension for Visual Studio Code supports the following features in Databricks notebooks that are run by the extension through its Databricks Connect integration. To enable these features, turn on the notebooks.dbconnect feature in Experiments: Opt Into in Visual Studio Code. See Settings for the Databricks extension for Visual Studio Code.

The following notebook globals are enabled:

  • spark, representing an instance of databricks.connect.DatabricksSession, is preconfigured to instantiate DatabricksSession by getting Databricks authentication credentials from the extension.

  • udf, preconfigured as an alias for pyspark.sql.functions.udf, which is an alias for Python UDFs.

  • sql, preconfigured as an alias for spark.sql. spark, as described earlier, represents a preconfigured instance of databricks.connect.DatabricksSession.

  • dbutils, preconfigured as an instance of Databricks Utilities, which is imported from databricks-sdk and is instantiated by getting Databricks authentication credentials from the extension.

    Note

    Only a subset of Databricks Utilities is supported for notebooks with Databricks Connect.

  • display, preconfigured as an alias for the Jupyter builtin IPython.display.display.

  • displayHTML, preconfigured as an alias for dbruntime.display.displayHTML, which is an alias for display.HTML from ipython.

The following notebook magics are available:

  • %fs, which is the same as making dbutils.fs calls.

  • %sh, which runs a command by using the cell magic %%script on the local machine. This does not run the command in the remote Databricks workspace.

  • %md and %md-sandbox, which runs the cell magic %%markdown.

  • %sql, which runs spark.sql.

  • %pip, which runs pip install on the local machine. This does not run pip install in the remote Databricks workspace.

  • %run, which runs another notebook. This notebook magic is available in Databricks extension for Visual Studio Code version 1.1.2 and above.

  • # MAGIC. This notebook magic is available in Databricks extension for Visual Studio Code version 1.1.2 and above.

Additional features include:

  • Spark DataFrames are converted to pandas DataFrames, which are displayed in Jupyter table format.

Limitations include:

  • The notebooks magics %r and %scala are not supported and display an error if called.

  • The notebook magic %sql does not support some DML commands, such as Show Tables.