Databricks extension for Visual Studio Code

Preview

This feature is in Public Preview.

The Databricks extension for Visual Studio Code enables you to connect to your remote Databricks workspaces from the Visual Studio Code integrated development environment (IDE) running on your local development machine. Through these connections, you can:

  • Synchronize local code that you develop in Visual Studio Code with code in your remote workspaces.

  • Run local Python code files from Visual Studio Code on Databricks clusters in your remote workspaces.

  • Run local Python code files (.py) and Python, R, Scala, and SQL notebooks (.py, .ipynb, .r, .scala, and .sql) from Visual Studio Code as automated Databricks jobs in your remote workspaces.

Note

The Databricks extension for Visual Studio Code supports running R, Scala, and SQL notebooks as automated jobs but does not provide any deeper support for these languages within Visual Studio Code.

Before you begin

Before you can use the Databricks extension for Visual Studio Code, your Databricks workspace and your local development machine must meet the following requirements. You must also have an access token to authenticate with Databricks.

Workspace requirements

You must have at least one Databricks workspace available, and the workspace must meet the following requirements:

  • The workspace must contain at least one Databricks cluster. If you do not have a cluster available, you can create a cluster now or after you install the Databricks extension for Visual Studio Code.

    Note

    Databricks recommends that you create a Personal Compute cluster. This enables you to start running workloads immediately, minimizing compute management overhead.

    Databricks SQL warehouses are not supported by this extension.

  • You must enable Files in Repos for the workspace.

  • The Databricks extension for Visual Studio Code relies on Databricks Repos in your workspace. Databricks recommends creating one repository for each combination of project and user. After you install the Databricks extension for Visual Studio Code, you can use it to create a local workspace repo; see Create a new repo.

    Note

    The Databricks extension for Visual Studio Code works only with repos that it creates. You cannot use existing repos in your workspace unless they were created earlier with the extension itself.

Access token

You must have a Databricks personal access token. If you do not have one available, you can generate a personal access token now.

Local development machine requirements

You must have the following on your local development machine:

  • Visual Studio Code version 1.69.1 or higher. To view your installed version, click Code > About Visual Studio Code from the manin menu on Linux or macOS and Help > About on Windows. To download, install, and configure Visual Studio Code, see Setting up Visual Studio Code.

  • Visual Studio Code must be configured for Python coding, including availability of a Python interpreter. For details, see Getting Started with Python in VS Code.

  • A Databricks configuration profile that references your Databricks personal access token. If you do not have one available, you can create a configuration profile after you install the Databricks extension for Visual Studio Code.

  • The Databricks extension for Visual Studio Code. For setup instructions, see the next section.

Getting started

Before you can use the Databricks extension for Visual Studio Code you must download, install, open, and configure the extension, as follows.

Install and open the extension

  1. In Visual Studio Code, open the Extensions view (View > Extensions from the main menu).

  2. In Search Extensions in Marketplace, enter Databricks.

  3. Click the Databricks entry.

    Note

    There are several entries with Databricks in their titles. Be sure to click the one with only Databricks in its title and a blue check mark icon next to Databricks.

  4. Click Install.

  5. Restart Visual Studio Code.

  6. Open the extension: on the sidebar, click the Databricks icon.

Configure the extension

To use the extension, you must set the Databricks configuration profile for Databricks authentication. You must also set the cluster and repository.

Set up authentication

With the extension opened, do the following:

  1. Open your code project’s folder in Visual Studio Code (File > Open Folder). If you do not have a code project then use PowerShell, your terminal for Linux or macOS, or Command Prompt for Windows, to create a folder, switch to the new folder, and then open Visual Studio Code from that folder. For example:

    mkdir databricks-demo
    cd databricks-demo
    code .
    
    md databricks-demo
    cd databricks-demo
    code .
    

    Tip

    If you get the error command not found: code, see Launching from the command line in the Visual Studio Code documentation.

  2. In the Configuration pane, click Configure Databricks.

    Note

    If Configure Databricks is not visible, click the gear (Configure workspace) icon next to Configuration instead.

    Gear icon to configure workspace settings 1
  3. In the Command Palette, for Databricks Host, enter your workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com. Then press Enter.

  4. Do one of the following:

  • If the Databricks extension for Visual Studio Code detects an existing matching Databricks configuration profile for the URL, you can select it in the list.

  • Click Edit Databricks profiles to open your Databricks configuration profiles file and create a configuration profile manually.

The extension creates a hidden folder in your project named .databricks if it does not already exist. The extension also creates in this folder a file named project.json if it does not already exist. This file contains the URL that you entered, along with some Databricks authentication details that the Databricks extension for Visual Studio Code needs to operate.

Set the cluster

With the extension and your code project opened, and a Databricks configuration profile already set, select an existing Databricks cluster that you want to use, or create a new Databricks cluster and use it.

Use an existing cluster

If you have an existing Databricks cluster that you want to use, do one of the following:

  • In the Clusters pane, do the following:

    1. Next to the cluster that you want to use, click the plug (Attach cluster) icon.

      Attach cluster icon 1

      Tip

      If the cluster is not visible in the Clusers pane, click the filter (Filter clusters) icon to see All clusters, clusters that are Created by me, or Running clusters. Or, click the Refresh icon next to the filter icon.

      Filter clusters icon 1

    The extension adds the cluster’s ID to your code project’s .databricks/project.json file, for example "clusterId": "1234-567890-abcd12e3".

    This procedure is complete.

  • In the Configuration pane, do the following:

    1. Next to Cluster, click the gear (Configure cluster) icon.

      Configure cluster icon 1
    2. In the Command Palette, click the cluster that you want to use.

    The extension adds the cluster’s ID to your code project’s .databricks/project.json file, for example "clusterId": "1234-567890-abcd12e3".

    This procedure is complete.

Create a new cluster

If you do not have an existing Databricks cluster, or you want to create a new one and use it, do the following:

  1. In the Configuration pane, next to Cluster, click the gear (Configure cluster) icon.

    Configure cluster icon 2
  2. In the Command Palette, click Create New Cluster.

  3. When prompted to open the external website (your Databricks workspace), click Open.

  4. If prompted, sign in to your Databricks workspace.

  5. Follow the instructions to create a cluster.

    Note

    Databricks recommends that you create a Personal Compute cluster. This enables you to start running workloads immediately, minimizing compute management overhead.

  6. After the cluster is created and is running, go back to Visual Studio Code.

  7. Do one of the following:

    • In the Clusters pane, next to the cluster that you want to use, click the plug (Attach cluster) icon.

      Attach cluster icon 2

      Tip

      If the cluster is not visible, click the filter (Filter clusters) icon to see All clusters, clusters that are Created by me, or Running clusters. Or, click the Refresh icon.

      Filter clusters icon 2

      The extension adds the cluster’s ID to the code project’s .databricks/project.json file, for example "clusterId": "1234-567890-abcd12e3".

      This procedure is complete.

    • In the Configuration pane, next to Cluster, click the gear (Configure cluster) icon.

      Configure cluster icon 3

      In the Command Palette, click the cluster that you want to use.

      The extension adds the cluster’s ID to the code project’s .databricks/project.json file, for example "clusterId": "1234-567890-abcd12e3".

Set the repository

With the extension and your code project opened, and a Databricks configuration profile already set, use the Databricks extension for Visual Studio Code to create a new repository in Databricks Repos and use it, or select an existing repository in Databricks Repos that you created earlier with the Databricks extension for Visual Studio Code and want to reuse instead.

Note

The Databricks extension for Visual Studio Code works only with repositories that it creates. You cannot use an existing repository in your workspace unless you used the Databricks extension for Visual Studio Code earlier to create that repository, and you now want to reuse that repository in your current Visual Studio Code project.

Create a new repo

To create a new repository, do the following:

  1. In the Configuration pane, next to Sync Destination, click the gear (Configure sync destination) icon.

    Configure sync destination icon 1
  2. In the Command Palette, click Create New Sync Destination.

  3. Type a name for the new repository in Databricks Repos, and then press Enter.

    The extension appends the characters .ide to the end of the repo’s name and then adds the repo’s workspace path to the code project’s .databricks/project.json file, for example "workspacePath": "/Workspace/Repos/someone@example.com/my-repo.ide".

    Note

    If the remote repo’s name does not match your local code project’s name, a warning icon appears with this message: The remote repo name does not match the current Visual Studio Code workspace name. You can ignore this warning if you intend to synchronize your local code project with a repo in your remote Databricks workspace and the names of your local code project and remote repo do not match.

  4. After you set the repository, begin synchronizing with the repository by clicking the Start synchronization icon next to Sync Destination.

    Start synchronization icon 1

Warning

After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.

Important

The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.

Reuse an existing repo

If you have an existing repository in Databricks Repos that you created earlier with the Databricks extension for Visual Studio Code and want to reuse in your current Visual Studio Code project, then do the following:

  1. In the Configuration pane, next to Sync Destination, click the gear (Configure sync destination) icon.

    Configure sync destination icon 2
  2. In the Command Palette, select the repository’s name from the list.

    The extension adds the repo’s workspace path to the code project’s .databricks/project.json file, for example "workspacePath": "/Workspace/Repos/someone@example.com/my-repo.ide".

    Note

    If the remote repo’s name does not match your local code project’s name, a warning icon appears with this message: The remote repo name does not match the current Visual Studio Code workspace name. You can ignore this warning if you intend to synchronize your local code project with a repo in your remote Databricks workspace and the names of your local code project and remote repo do not match.

  3. After you set the repository, begin synchronizing with the repository by clicking the Start synchronization icon next to Sync Destination.

    Start synchronization icon 2

Warning

After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.

Important

The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.

Initiate repo changes from the workspace instead of from Visual Studio Code

The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must also do the following to have those changes show in your local Visual Studio Code project:

  1. Create a new, empty repository with a supported Git provider. This new, empty repository must have no prior commits. To learn how to create this repository, see your Git provider’s documentation.

  2. Install Git on your local development machine, if you have not done so already.

  3. In Visual Studio Code open the Command Palette (View > Command Palette), type clone, and then select Git: Clone.

  4. In the Command Palette, for Provide repository URL or pick a repository source, enter the repository’s clone URL as specified by your Git provider, and then press Enter.

  5. Select a parent folder on your local development machine in which to clone the repository’s contents (for example, the root folder of your local user’s home directory), and then click Select as Repository Destination.

  6. When prompted to open the cloned repository, click Open.

  7. From your new, empty code project folder that was just opened, use the Databricks extension for Visual Studio Code to Set up authentication and then Set the cluster.

  8. In the root of your new, empty code project folder, create a .gitignore file, add a .databricks/ entry to this file, and then save this file. This prevents the hidden .databricks/ folder and its contents that the Databricks extension for Visual Studio Code generates from accidentally being checked into source control.

  9. Use the Databricks extension for Visual Studio Code to Create a new repo in your remote Databricks workspace and then connect to it.

  10. Switch the existing connection for the new repo in your remote Databricks workspace to the new, empty repository with your Git provider, as follows:

    1. First, configure your Databricks workspace with your Git provider credentials, by following the instructions in Add Git credentials to Databricks.

    2. In the Databricks extension for Visual Studio Code in the Configuration pane, next to Sync Destination, click the linked chain (Open link externally) icon.

    3. When prompted to open the external website, click Open.

    4. If prompted, follow the on-screen instructions to sign in to your Databricks workspace.

    5. In the workspace’s Repos pane, click the drop-down arrow next to the new repo’s name, and then click Git.

    6. On the Settings tab, for Git repository URL, replace the exising value of https://github.com/databricks/databricks-empty-ide-project.git with the repository clone URL for the new, empty repository with your Git provider.

    7. For Git provider, select the name of your Git provider.

    8. Click Save.

  11. Create or copy over any files that you want to work on into your new, empty code project folder in Visual Studio Code. Do not create or copy any files yet into the repo in your Databricks workspace nor into the repo with your Git provider.

  12. In the Databricks extension for Visual Studio Code, in the Configuration pane, next to Sync Destination, click the arrowed circle (Start synchronization) icon. The extension copies the files from your code project folder in Visual Studio Code into the new repo in your Databricks workspace.

  13. As you continue to make any changes to your code project folder in Visual Studio Code, these changes are automatically synchronized to the new repo in your remote Databricks workspace.

  14. If you make any changes to the new repo in your remote Databricks workspace, commit and push those changes to the connected repository with your Git provider.

  15. Pull the changes from the repository with your Git provider into your local Visual Studio Code project. For instructions, see Working with GitHub in VS Code or your Git provider’s documentation.

Development tasks

After you configure the Databricks extension for Visual Studio Code, you can use the extension to run a local Python file on a cluster in a remote Databricks workspace, or run a local Python file or local Python, R, Scala, or SQL notebook as a job in a remote workspace, as follows.

If you do not have a local file or notebook available to test the Databricks extension for Visual Studio Code with, here is some basic code that you can add to your project:

from pyspark.sql import SparkSession
from pyspark.sql.types import *

spark = SparkSession.builder.getOrCreate()

schema = StructType([
  StructField('CustomerID', IntegerType(), False),
  StructField('FirstName',  StringType(),  False),
  StructField('LastName',   StringType(),  False)
])

data = [
  [ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
  [ 1001, 'Joost',   'van Brunswijk' ],
  [ 1002, 'Stan',    'Bokenkamp' ]
]

customers = spark.createDataFrame(data, schema)
customers.show()

# Output:
#
# +----------+---------+-------------------+
# |CustomerID|FirstName|           LastName|
# +----------+---------+-------------------+
# |      1000|  Mathijs|Oosterhout-Rijntjes|
# |      1001|    Joost|      van Brunswijk|
# |      1002|     Stan|          Bokenkamp|
# +----------+---------+-------------------+
# Databricks notebook source
from pyspark.sql.types import *

schema = StructType([
  StructField('CustomerID', IntegerType(), False),
  StructField('FirstName',  StringType(),  False),
  StructField('LastName',   StringType(),  False)
])

data = [
  [ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
  [ 1001, 'Joost',   'van Brunswijk' ],
  [ 1002, 'Stan',    'Bokenkamp' ]
]

customers = spark.createDataFrame(data, schema)
customers.show()

# Output:
#
# +----------+---------+-------------------+
# |CustomerID|FirstName|           LastName|
# +----------+---------+-------------------+
# |      1000|  Mathijs|Oosterhout-Rijntjes|
# |      1001|    Joost|      van Brunswijk|
# |      1002|     Stan|          Bokenkamp|
# +----------+---------+-------------------+
# Databricks notebook source
library(SparkR)

sparkR.session()

data <- list(
          list(1000L, "Mathijs", "Oosterhout-Rijntjes"),
          list(1001L, "Joost",   "van Brunswijk"),
          list(1002L, "Stan",    "Bokenkamp")
        )

schema <- structType(
            structField("CustomerID", "integer"),
            structField("FirstName",  "string"),
            structField("LastName",   "string")
          )

df <- createDataFrame(
        data   = data,
        schema = schema
      )

showDF(df)

# Output:
#
# +----------+---------+-------------------+
# |CustomerID|FirstName|           LastName|
# +----------+---------+-------------------+
# |      1000|  Mathijs|Oosterhout-Rijntjes|
# |      1001|    Joost|      van Brunswijk|
# |      1002|     Stan|          Bokenkamp|
# +----------+---------+-------------------+
// Databricks notebook source
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val schema = StructType(Array(
  StructField("CustomerID", IntegerType, false),
  StructField("FirstName",  StringType, false),
  StructField("LastName",   StringType, false)
))

val data = List(
  Row(1000, "Mathijs", "Oosterhout-Rijntjes"),
  Row(1001, "Joost",   "van Brunswijk"),
  Row(1002, "Stan",    "Bokenkamp"),
)

val rdd = spark.sparkContext.makeRDD(data)
val customers = spark.createDataFrame(rdd, schema)

display(customers)

// Output:
//
// +----------+---------+-------------------+
// |CustomerID|FirstName|           LastName|
// +----------+---------+-------------------+
// |      1000|  Mathijs|Oosterhout-Rijntjes|
// |      1001|    Joost|      van Brunswijk|
// |      1002|     Stan|          Bokenkamp|
// +----------+---------+-------------------+
-- Databricks notebook source
CREATE TABLE IF NOT EXISTS zzz_customers(
  CustomerID INT,
  FirstName  STRING,
  LastName   STRING
);

-- COMMAND ----------
INSERT INTO zzz_customers VALUES
  (1000, "Mathijs", "Oosterhout-Rijntjes"),
  (1001, "Joost",   "van Brunswijk"),
  (1002, "Stan",    "Bokenkamp");

-- COMMAND ----------
SELECT * FROM zzz_customers;

-- Output:
--
-- +----------+---------+-------------------+
-- |CustomerID|FirstName|           LastName|
-- +----------+---------+-------------------+
-- |      1000|  Mathijs|Oosterhout-Rijntjes|
-- |      1001|    Joost|      van Brunswijk|
-- |      1002|     Stan|          Bokenkamp|
-- +----------+---------+-------------------+

-- COMMAND ----------
DROP TABLE zzz_customers;

Enable PySpark and Databricks Utilities code completion

To enable IntelliSense (also known as code completion) in the Visual Studio Code code editor for PySpark, Databricks Utilities, and related globals such as spark and dbutils, do the following with your code project opened:

  1. On the Command Palette (View > Command Palette), type Databricks: Configure autocomplete for Databricks globals and press Enter.

  2. Follow the on-screen prompts to allow the Databricks extension for Visual Studio Code to install PySpark for your project, and to add or modify the __builtins__.pyi file for your project to enable Databricks Utilities.

You can now use globals such as spark and dbutils in your code without declaring any related import statements beforehand.

Run a Python file on a cluster

With the extension and your code project opened, and a Databricks configuration profile, cluster, and repo already set, do the following:

  1. In your code project, open the Python file that you want to run on the cluster.

  2. Do one of the following:

    • In Explorer view (View > Explorer), right-click the file, and then select Run File on Databricks from the context menu.

      Run File on Databricks context menu command
    • In the file editor’s title bar, click the drop-down arrow next to the play (Run or Debug) icon. Then in the drop-down list, click Run File on Databricks.

      Run File on Databricks editor command

The file runs on the cluster, and any output is printed to the Debug Console (View > Debug Console).

Run a Python file as a job

With the extension and your code project opened, and a Databricks configuration profile, cluster, and repo already set, do the following:

  1. In your code project, open the Python file that you want to run as a job.

  2. Do one of the following:

    • In Explorer view (View > Explorer), right-click the file, and then select Run File as Workflow on Databricks from the context menu.

      Run File as Workflow on Databricks context menu command 1
    • In the file editor’s title bar, click the drop-down arrow next to the play (Run or Debug) icon. Then in the drop-down list, click Run File as Workflow on Databricks.

      Run File as Workflow on Databricks editor command 1

A new editor tab appears, titled Databricks Job Run. The file runs as a job in the workspace, and any output is printed to the new editor tab’s Output area.

To view information about the job run, click the Task run ID link in the new Databricks Job Run editor tab. Your workspace opens and the job run’s details are displayed in the workspace.

Run a Python notebook as a job

With the extension and your code project opened, and a Databricks configuration profile, cluster, and repo already set, do the following:

  1. In your code project, open the Python notebook that you want to run as a job.

    Tip

    To create a Python notebook file in Visual Studio Code, begin by clicking File > New File, select Python File, and save the new file with a .py file extension.

    To turn the .py file into a Databricks notebook, add the special comment # Databricks notebook source to the beginning of the file, and add the special comment # COMMAND ---------- before each cell. For more information, see Import a file and convert it to a notebook.

    A Python code file formatted as a Databricks notebook1
  2. Do one of the following:

    • In Explorer view (View > Explorer), right-click the notebook file, and then select Run File as Workflow on Databricks from the context menu.

      Run File as Workflow on Databricks context menu command 1
    • In the notebook file editor’s title bar, click the drop-down arrow next to the play (Run or Debug) icon. Then in the drop-down list, click Run File as Workflow on Databricks.

      Run File as Workflow on Databricks editor command 2

A new editor tab appears, titled Databricks Job Run. The notebook runs as a job in the workspace, and the notebook and its output are displayed in the new editor tab’s Output area.

To view information about the job run, click the Task run ID link in the Databricks Job Run editor tab. Your workspace opens and the job run’s details are displayed in the workspace.

Run an R, Scala, or SQL notebook as a job

With the extension and your code project opened, and a Databricks configuration profile, cluster, and repo already set, do the following:

  1. In your code project, open the R, Scala, or SQL notebook that you want to run as a job.

    Tip

    To create an R, Scala, or SQL notebook file in Visual Studio Code, begin by clicking File > New File, select Python File, and save the new file with a .r, .scala, or .sql file extension, respectively.

    To turn the .r, .scala, or .sql file into a Databricks notebook, add the special comment Databricks notebook source to the beginning of the file and add the special comment COMMAND ---------- before each cell. Be sure to use the correct comment marker for each language (# for R, // for Scala, and -- for SQL). For more information, see Import a file and convert it to a notebook.

    This is similar to the pattern for Python notebooks:

    A Python code file formatted as a Databricks notebook 2
  2. In Run and Debug view (View > Run), select Run on Databricks as Workflow from the drop-down list, and then click the green play arrow (Start Debugging) icon.

    Run File as Workflow on Databricks editor command 3

    Note

    If Run on Databricks as Workflow is not available, see Create a custom run configuration.

A new editor tab appears, titled Databricks Job Run. The notebook runs as a job in the workspace. The notebook and its output are displayed in the new editor tab’s Output area.

To view information about the job run, click the Task run ID link in the Databricks Job Run editor tab. Your workspace opens and the job run’s details are displayed in the workspace.

Advanced tasks

You can use the Databricks extension for Visual Studio Code to perform the following advanced tasks.

Run tests with pytest

You can run pytest on local code that does not need a connection to a cluster in a remote Databricks workspace. For example, you might use pytest to test your functions that accept and return PySpark DataFrames in local memory. To get started with pytest and run it locally, see Get Started in the pytest documentation.

To run pytest on code in a remote Databricks workspace, do the following in your Visual Studio Code project:

Step 1: Create the tests

Add a Python file with the following code, which contains your tests to run. This example assumes that this file is named spark_test.py and is at the root of your Visual Studio Code project. This file contains a pytest fixture, which makes the cluster’s SparkSession (the entry point to Spark functionality on the cluster) available to the tests. This file contains a single test that checks whether the specified cell in the table contains the specified value. You can add your own tests to this file as needed.

from pyspark.sql import SparkSession
import pytest

@pytest.fixture
def spark() -> SparkSession:
  # Create a SparkSession (the entry point to Spark functionality) on
  # the cluster in the remote Databricks workspace. Unit tests do not
  # have access to this SparkSession by default.
  return SparkSession.builder.getOrCreate()

# Now add your unit tests.

# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third colum in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut   | color | clarity | depth | table | price | x    | y     | z    |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1  | 0.23  | Ideal | E     | SI2     | 61.5  | 55    | 326   | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
  spark.sql('USE default')
  data = spark.sql('SELECT * FROM diamonds')
  assert data.collect()[0][2] == 'Ideal'

Step 2: Create the pytest runner

Add a Python file with the following code, which instructs pytest to run your tests from the previous step. This example assumes that the file is named pytest_databricks.py and is at the root of your Visual Studio Code project.

import pytest
import os
import sys

# Run all tests in the connected repository in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".

# Get the path to the repository for this file in the workspace.
repo_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the repository's root directory.
os.chdir(repo_root)

# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True

# Now run pytest from the repository's root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
#   ...
#   "program": "${workspaceFolder}/path/to/this/file/in/workspace",
#   "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])

Step 3: Create a custom run configuration

To instruct pytest to run your tests, you must create a custom run configuration. Use the existing Databricks cluster-based run configuration to create your own custom run configuration, as follows:

  1. On the main menu, click Run > Add configuration.

  2. In the Command Palette, select Databricks.

    Visual Studio Code adds a .vscode/launch.json file to your project, if this file does not already exist.

  3. Change the starter run configuration as follows, and then save the file:

    • Change this run configuration’s name from Run on Databricks to some unique display name for this configuration, in this example Unit Tests (on Databricks).

    • Change program from ${file} to the path in the project that contains the test runner, in this example ${workspaceFolder}/pytest_databricks.py.

    • Change args from [] to the path in the project that contains the files with your tests, in this example ["."].

    Your launch.json file should look like this:

    {
      // Use IntelliSense to learn about possible attributes.
      // Hover to view descriptions of existing attributes.
      // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
      "version": "0.2.0",
      "configurations": [
        {
          "type": "databricks",
          "request": "launch",
          "name": "Unit Tests (on Databricks)",
          "program": "${workspaceFolder}/pytest_databricks.py",
          "args": ["."],
          "env": {}
        }
      ]
    }
    

Step 4: Run the tests

Make sure that pytest is already installed on the cluster first. For example, with the cluster’s settings page open in your Databricks workspace, do the following:

  1. On the Libraries tab, if pytest is visible, then pytest is already installed. If pytest is not visible, click Install new.

  2. For Library Source, click PyPI.

  3. For Package, enter pytest.

  4. Click Install.

  5. Wait until Status changes from Pending to Installed.

To run the tests, do the following from your Visual Studio Code project:

  1. On the main menu, click View > Run.

  2. In the Run and Debug list, click Unit Tests (on Databricks), if it is not already selected.

  3. Click the green arrow (Start Debugging) icon.

The pytest results display in the Debug Console (View > Debug Console on the main menu). For example, these results show that at least one test was found in the spark_test.py file, and a dot (.) means that a single test was found and passed. (A failing test would show an F.)

<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Repos/<someone@example.com>/<your-repository-name> ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/Repos/<someone@example.com>/<your-repository-name>
collected 1 item

spark_test.py .                                                          [100%]

============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)

Change authentication settings

You can change the Databricks authentication settings that the extension is set to, as follows.

With the extension and your code project opened, do the following:

  1. In the Configuration pane, click the gear (Configure workspace) icon.

    Gear icon to configure workspace settings 2
  2. Follow the steps in Set up authentication.

Change the cluster

You can change the cluster that the extension is set to, as follows.

With the extension and your code project opened, and a Databricks configuration profile and cluster already set, select an existing Databricks cluster that you want to change to, or create a new Databricks cluster and change to it.

Change to an existing cluster

If you have an existing Databricks cluster that you want to change to, do one of the following:

  • In the Clusters pane, next to the cluster that you want to change to, click the plug (Attach cluster) icon.

    Attach cluster icon 3

    Tip

    If the cluster is not visible in the Clusers pane, click the filter (Filter clusters) icon to see All clusters, clusters that are Created by me, or Running clusters. Or, click the Refresh icon next to the filter icon.

    Filter clusters icon 3

    The extension replaces the cluster’s ID in your code project’s .databricks/project.json file, for example "clusterId": "1234-567890-abcd12e3".

    This procedure is complete.

  • In the Configuration pane, do the following:

    1. Next to Cluster, click the gear (Configure cluster) icon.

      Gear icon to configure workspace settings 3
    2. In the Command Palette, click the cluster that you want to change to.

    The extension replaces the cluster’s ID in your code project’s .databricks/project.json file, for example "clusterId": "1234-567890-abcd12e3".

    This procedure is complete.

Create a new cluster and change to it

If you want to create a new Databricks cluster and change to it, do the following:

  1. In the Configuration pane, next to Cluster, click the gear (Configure cluster) icon.

    Configure cluster icon 4
  2. In the Command Palette, click Create New Cluster.

  3. When prompted to open the external website (your Databricks workspace), click Open.

  4. If prompted, sign in to your Databricks workspace.

  5. Follow the instructions to create a cluster.

    Note

    Databricks recommends that you create a Personal Compute cluster. This enables you to start running workloads immediately, minimizing compute management overhead.

  6. After the cluster is created and is running, go back to Visual Studio Code.

  7. Do one of the following:

    • In the Clusters pane, next to the cluster that you want to change to, click the plug (Attach cluster) icon.

      Attach cluster icon 4

      Tip

      If the cluster is not visible, click the filter (Filter clusters) icon to see All clusters, clusters that are Created by me, or Running clusters. Or, click the Refresh icon.

      Filter clusters icon 4

      The extension replaces the cluster’s ID in your code project’s .databricks/project.json file, for example "clusterId": "1234-567890-abcd12e3".

      This procedure is complete.

    • In the Configuration pane, next to Cluster, click the gear (Configure cluster) icon.

      Configure cluster icon 5

      In the Command Palette, click the cluster that you want to change to.

      The extension replaces the cluster’s ID in your code project’s .databricks/project.json file, for example "clusterId": "1234-567890-abcd12e3".

Change the repository

You can change the repository that the extension is set to, as follows.

With the extension and your code project opened, and a Databricks configuration profile already set, select an existing repository in Databricks Repos that you want to change to, or create a new respository in Databricks Repos and change to it.

Note

The Databricks extension for Visual Studio Code works only with repositories that it creates. You cannot use an existing repository in your workspace unless you used the Databricks extension for Visual Studio Code earlier to create that repository, and you now want to reuse that repository in your current Visual Studio Code project.

Change to an existing repo

If you have an existing repository in Databricks Repos that you want to change to, then do the following:

  1. In the Configuration pane, next to Sync Destination, click the gear (Configure sync destination) icon.

    Configure sync destination icon 3
  2. In the Command Palette, select the repository’s name from the list.

    The extension replaces the repository’s workspace path in your code project’s .databricks/project.json file, for example "workspacePath": "/Workspace/Repos/someone@example.com/my-repo.ide".

    Note

    If the remote repo’s name does not match your local code project’s name, a warning icon appears with this message: The remote repo name does not match the current Visual Studio Code workspace name. You can ignore this warning if you intend to synchronize your local code project with a repo in your remote Databricks workspace and the names of your local code project and remote repo do not match.

  3. After you set the repository, begin synchronizing with the repository by clicking the Start synchronization icon next to Sync Destination.

    Start synchronization icon 3

Warning

After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.

Important

The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.

Create a new repo and change to it

If you want to create a new repository in Databricks Repos and change to it, do the following:

  1. In the Configuration pane, next to Sync Destination, click the gear (Configure sync destination) icon.

    Configure sync destination icon 4
  2. In the Command Palette, click Create New Sync Destination.

  3. Type a name for the new repository in Databricks Repos, and then press Enter.

    The extension appends the characters .ide to the end of the repo’s name and then replaces the repository’s workspace path in your code project’s .databricks/project.json file, for example "workspacePath": "/Workspace/Repos/someone@example.com/my-repo.ide".

    Note

    If the remote repo’s name does not match your local code project’s name, a warning icon appears with this message: The remote repo name does not match the current Visual Studio Code workspace name. You can ignore this warning if you intend to synchronize your local code project with a repo in your remote Databricks workspace and the names of your local code project and remote repo do not match.

  4. After you set the repository, begin synchronizing with the repository by clicking the Start synchronization icon next to Sync Destination.

    Start synchronization icon 4

Warning

After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.

Important

The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.

Start or stop the cluster

You can start or stop the cluster that the extension is set to, as follows.

With the extension and your code project opened, and a Databricks configuration profile and cluster already set, do one of the following:

  • To start the cluster, in the Configuration pane, next to Cluster, click the play (Start Cluster) icon.

    Start cluster icon
  • To stop the cluster, in the Configuration pane, next to Cluster, click the stop (Stop Cluster) icon.

    Stop cluster icon

Stop synchronizing the repository

You can stop synchronizing with the repository that the extension is set to, as follows.

With the extension and your code project opened, and a Databricks configuration profile already set, in the Configuration pane (next to Sync Destination), click the Stop synchronization icon.

Stop synchronization icon

To restart a stopped synchronization, click the Start synchronization icon.

Start synchronization icon 5

Warning

After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.

Important

The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.

Create a custom run configuration

You can create custom run configurations in Visual Studio Code to do things such as passing custom arguments to a job or a notebook, or creating different run settings for different files. For example, the following custom run configuration passes the --prod argument to the job:

{
  "version": "0.2.0",
  "configurations": [
    {
      "type": "databricks-workflow",
      "request": "launch",
      "name": "Run on Databricks as Workflow",
      "program": "${file}",
      "parameters": {},
      "args": ["--prod"],
      "preLaunchTask": "databricks: sync"
    }
  ]
}

To create a custom run configuration, click Run > Add Configuration from the main menu in Visual Studio Code. Then select either Databricks for a cluster-based run configuration or Databricks: Workflow for a job-based run configuration.

By using custom run configurations, you can also pass in command-line arguments and run your code just by pressing F5. For more information, see Launch configurations in the Visual Studio Code documentation.

Uninstall the extension

You can uninstall the Databricks extension for Visual Studio Code if needed, as follows:

  1. In Visual Studio Code, click View > Extensions from the main menu.

  2. In the list of extensions, select the Databricks for Visual Studio Code entry.

  3. Click Uninstall.

  4. Click Reload required, or restart Visual Studio Code.

Troubleshooting

Error when synchronizing through a proxy

Issue: When you try to run the Databricks extension for Visual Studio Code to synchronize your local code project through a proxy, an error message similar to the following appears, and the synchronization operation is unsuccessful: Get "https://<workspace-instance>/api/2.0/preview/scim/v2/Me": EOF.

Possible cause: Visual Studio Code does not know how to find the proxy.

Recommended solution: Restart Visual Studio Code from your terminal by running the following command, and then try synchronizing again:

env HTTPS_PROXY=<proxy-url>:<port> code

In the preceding command:

  • Replace <proxy-url> with the full URL to your proxy.

  • Replace <port> with the correct port on your proxy.

Error: “spawn unknown system error -86” when you try to synchronize local code

Issue: When you try to synchronize local code in a project to a remote Databricks workspace, the Terminal shows that synchronization has started but displays only the error message spawn unknown system error -86. Also, the Sync Destination section of the Configuration pane remains in a pending state.

Possible cause: The wrong version of the Databricks extension for Visual Studio Code is installed for your development machine’s operating system.

Recommend solution: Uninstall the extension, and then Install and open the extension for your development machine’s operating system from the beginning.

Send usage logs to Databricks

If you have issues synchronizing local code to a remote Databricks workspace, you can send usage logs and related information to Databricks Support by doing the following:

  1. Turn on verbose mode for the bricks command-line interface (CLI) by checking the Bricks: Verbose Mode setting, or setting databricks.bricks.verboseMode to true, as described in Settings.

  2. Also turn on logging by checking the Logs: Enabled setting, or setting databricks.logs.enabled to true, as described in Settings. Be sure to restart Visual Studio Code after you turn on logging.

  3. Attempt to reproduce your issue.

  4. From the Command Palette (View > Command Palette from the main menu), run the Databricks: Open full logs command.

  5. Send the bricks-logs.json and sdk-and-extension-logs.json files that appear to Databricks Support.

  6. Also copy the contents of the Terminal (View > Terminal) in the context of the issue, and send this content to Databricks Support.

To send error logs that are not about code synchronization issues to Databricks Support:

  1. From the Command Palette (View > Command Palette), run the Databricks: Open full logs command.

  2. Send only the sdk-and-extension-logs.json file that appears to Databricks Support.

The Output view (View > Output, Databricks Logs) shows truncated information if Logs: Enabled is checked or databricks.logs.enabled is set to true. To show more information, change the following settings, as described in Settings:

  • Logs: Max Array Length or databricks.logs.maxArrayLength

  • Logs: Max Field Length or databricks.logs.maxFieldLength

  • Logs: Truncation Depth or databricks.logs.truncationDepth

Command Palette

The Databricks extension for Visual Studio Code adds the following commands to the Visual Studio Code Command Palette. See also Command Palette in the Visual Studio Code documentation.

Command

Description

Databricks: Configure autocomplete for Databricks globals

Enables IntelliSense in the Visual Studio Code code editor for PySpark, Databricks Utilities, and related globals such as spark and dbutils. See Enable PySpark and Databricks Utilities code completion.

Databricks: Configure cluster

Moves focus to the Commmand Palette to create, select, or change the Databricks cluster to use for the current project. See Set the cluster and Change the cluster.

Databricks: Configure sync destination

Moves focus to the Command Palette to create, select, or change the repository in Databricks Repos to use for the current project. See Set the repository and Change the repository.

Databricks: Configure workspace

Moves focus to the Command Palette to create, select, or change Databricks authentication details to use for the current project. See Set up authentication.

Databricks: Detach cluster

Removes the reference to the Databricks cluster from the current project.

Databricks: Detach sync destination

Removes the reference to the repository in Databricks Repo from the current project.

Databricks: Focus on Clusters View

Moves focus in the Databricks view to the Clusters pane.

Databricks: Focus on Configuration View

Moves focus in the Databricks view to the Configuration pane.

Databricks: Logout

Resets the Databricks view to show the Configure Databricks and Show Quickstart buttons in the Configuration pane. Any content in the current project’s .databricks/project.json file is also reset. See Configure the extension.

Databricks: Open Databricks configuration file

Opens the Databricks configuration profiles file, from the default location, for the current project. See Set up authentication.

Databricks: Open full logs

Opens the folder that contains the application log files that the Databricks extension for Visual Studio Code writes to your development machine.

Databricks: Show Quickstart

Shows the Quickstart file in the editor.

Databricks: Start synchronization

Starts synchronizing the current project’s code to the Databricks workspace. This command performs an incremental synchronization.

Databricks: Start synchronization (full sync)

Starts synchronizing the current project’s code to the Databricks workspace. This command performs a full synchronization, even if an incremental sync is possible.

Databricks: Stop synchronization

Stops synchronizing the current project’s code to the Databricks workspace. See Stop synchronizing the repository.

Settings

The Databricks extension for Visual Studio Code adds the following settings to Visual Studio Code. See also User and Workspace Settings in the Visual Studio Code documentation.

Setting (UI, Extensions > Databricks)

Setting (JSON, settings.json)

Description

Bricks: Verbose Mode

databricks.bricks.verboseMode

Checked or set to true to enable verbose logging for the bricks command-line interface (CLI) when it synchronizes local code with code in your remote workspace. The default is unchecked or false (do not enable verbose logging for the bricks CLI).

Clusters: Only Show Accessible Clusters

databricks.clusters.onlyShowAccessibleClusters

Checked or set to true to enable filtering for only those clusters that you can run code on. The default is unchecked or false (do not enable filtering for those clusters).

Logs: Enabled

databricks.logs.enabled

Checked or set to true (default) to enable logging. Reload your window for any change to take effect.

Logs: Max Array Length

databricks.logs.maxArrayLength

The maximum number of items to show for array fields. The default is 2.

Logs: Max Field Length

databricks.logs.maxFieldLength

The maximum length of each field displayed in the logs output panel. The default is 40.

Logs: Truncation Depth

databricks.logs.truncationDepth

The maximum depth of logs to show without truncation. The default is 2.

Override Databricks Config File

databricks.overrideDatabricksConfigFile

An alternate location for the .databrickscfg file that the extension uses for authentication.

Frequently asked questions (FAQs)

Do you have support for, or a timeline for support for, any of the following capabilities?

  • Step-through debugging

  • Other languages, such as Scala or SQL

  • Delta Live Tables

  • Databricks SQL warehouses

  • Other IDEs, such as PyCharm

  • Additional libraries

  • Full CI/CD integration

  • Authentication schemes in addition to Databricks personal access tokens

Databricks is aware of these requests and is prioritizing work to enable simple scenarios for local development and remote running of code. Please forward additional requests and scenarios to your Databricks representative. Databricks will incorporate your input into future planning.

How does the Databricks Terraform provider relate to the Databricks extension for Visual Studio Code?

Databricks continues to recommend the Databricks Terraform provider for managing your CI/CD pipelines in a predictable way. Please let your Databricks representative know how you might use an IDE to manage your deployments in the future. Databricks will incorporate your input into future planning.

How does Databricks Connect relate to the Databricks extension for Visual Studio Code?

Databricks Connect users will likely keep using it for the foreseeable future. Databricks is thinking about how the Databricks extension for Visual Studio Code might provide functionality in the future that is similar to Databricks Connect.

How does dbx by Databricks Labs relate to the Databricks extension for Visual Studio Code?

The main features of dbx by Databricks Labs include:

  • Project scaffolding.

  • Limited local development through the dbx execute command.

  • CI/CD for Databricks jobs.

The Databricks extension for Visual Studio Code enables local development and remotely running Python code files on Databricks clusters, and remotely running Python code files and notebooks in Databricks jobs. dbx can continue to be used for project scaffolding and CI/CD for Databricks jobs.

What happens if I already have an existing Databricks configuration profile that I created through the Databricks CLI?

You can select your existing configuration profile when you configure the Databricks extension for Visual Studio Code. With the extension and your code project opened, do the following:

  1. In the Configuration pane, click the gear (Configure workspace) icon.

    Gear icon to configure workspace settings 4
  2. Enter your workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com.

  3. In the Command Palette, select your existing configuration profile.

Which permissions do I need for a Databricks workspace to use the Databricks extension for Visual Studio Code?

You must have execute permissions for a Databricks cluster for running code, as well as permissions to create a repository in Databricks Repos.

Which settings must be enabled for a Databricks workspace to use the Databricks extension for Visual Studio Code?

The workspace must have the Files in Repos setting turned on. For instructions, see Configure support for Files in Repos. If you cannot turn on this setting yourself, contact your Databricks workspace administrator.