Databricks extension for Visual Studio Code
Preview
This feature is in Public Preview.
The Databricks extension for Visual Studio Code enables you to connect to your remote Databricks workspaces from the Visual Studio Code integrated development environment (IDE) running on your local development machine. Through these connections, you can:
Synchronize local code that you develop in Visual Studio Code with code in your remote workspaces.
Run local Python code files from Visual Studio Code on Databricks clusters in your remote workspaces.
Run local Python code files (
.py
) and Python, R, Scala, and SQL notebooks (.py
,.ipynb
,.r
,.scala
, and.sql
) from Visual Studio Code as automated Databricks jobs in your remote workspaces.
Note
The Databricks extension for Visual Studio Code supports running R, Scala, and SQL notebooks as automated jobs but does not provide any deeper support for these languages within Visual Studio Code.
Before you begin
Before you can use the Databricks extension for Visual Studio Code, your Databricks workspace and your local development machine must meet the following requirements. You must also have an access token to authenticate with Databricks.
Workspace requirements
You must have at least one Databricks workspace available, and the workspace must meet the following requirements:
The workspace must contain at least one Databricks cluster. If you do not have a cluster available, you can create a cluster now or after you install the Databricks extension for Visual Studio Code.
Note
Databricks recommends that you create a Personal Compute cluster. This enables you to start running workloads immediately, minimizing compute management overhead.
Databricks SQL warehouses are not supported by this extension.
You must enable Files in Repos for the workspace.
The Databricks extension for Visual Studio Code relies on Databricks Repos in your workspace. Databricks recommends creating one repository for each combination of project and user. After you install the Databricks extension for Visual Studio Code, you can use it to create a local workspace repo; see Create a new repo.
Note
The Databricks extension for Visual Studio Code works only with repos that it creates. You cannot use existing repos in your workspace unless they were created earlier with the extension itself.
Access token
You must have a Databricks personal access token. If you do not have one available, you can generate a personal access token now.
Local development machine requirements
You must have the following on your local development machine:
Visual Studio Code version 1.69.1 or higher. To view your installed version, click Code > About Visual Studio Code from the manin menu on Linux or macOS and Help > About on Windows. To download, install, and configure Visual Studio Code, see Setting up Visual Studio Code.
Visual Studio Code must be configured for Python coding, including availability of a Python interpreter. For details, see Getting Started with Python in VS Code.
A Databricks configuration profile that references your Databricks personal access token. If you do not have one available, you can create a configuration profile after you install the Databricks extension for Visual Studio Code.
The Databricks extension for Visual Studio Code. For setup instructions, see the next section.
Getting started
Before you can use the Databricks extension for Visual Studio Code you must download, install, open, and configure the extension, as follows.
Install and open the extension
In Visual Studio Code, open the Extensions view (View > Extensions from the main menu).
In Search Extensions in Marketplace, enter Databricks.
Click the Databricks entry.
Note
There are several entries with Databricks in their titles. Be sure to click the one with only Databricks in its title and a blue check mark icon next to Databricks.
Click Install.
Restart Visual Studio Code.
Open the extension: on the sidebar, click the Databricks icon.
Configure the extension
To use the extension, you must set the Databricks configuration profile for Databricks authentication. You must also set the cluster and repository.
Set up authentication
With the extension opened, do the following:
Open your code project’s folder in Visual Studio Code (File > Open Folder). If you do not have a code project then use PowerShell, your terminal for Linux or macOS, or Command Prompt for Windows, to create a folder, switch to the new folder, and then open Visual Studio Code from that folder. For example:
mkdir databricks-demo cd databricks-demo code .
md databricks-demo cd databricks-demo code .
Tip
If you get the error
command not found: code
, see Launching from the command line in the Visual Studio Code documentation.In the Configuration pane, click Configure Databricks.
Note
If Configure Databricks is not visible, click the gear (Configure workspace) icon next to Configuration instead.
In the Command Palette, for Databricks Host, enter your workspace instance URL, for example
https://dbc-a1b2345c-d6e7.cloud.databricks.com
. Then press Enter.Do one of the following:
If the Databricks extension for Visual Studio Code detects an existing matching Databricks configuration profile for the URL, you can select it in the list.
Click Edit Databricks profiles to open your Databricks configuration profiles file and create a configuration profile manually.
The extension creates a hidden folder in your project named .databricks
if it does not already exist. The extension also creates in this folder a file named project.json
if it does not already exist. This file contains the URL that you entered, along with some Databricks authentication details that the Databricks extension for Visual Studio Code needs to operate.
Set the cluster
With the extension and your code project opened, and a Databricks configuration profile already set, select an existing Databricks cluster that you want to use, or create a new Databricks cluster and use it.
Use an existing cluster
If you have an existing Databricks cluster that you want to use, do one of the following:
In the Clusters pane, do the following:
Next to the cluster that you want to use, click the plug (Attach cluster) icon.
Tip
If the cluster is not visible in the Clusers pane, click the filter (Filter clusters) icon to see All clusters, clusters that are Created by me, or Running clusters. Or, click the Refresh icon next to the filter icon.
The extension adds the cluster’s ID to your code project’s
.databricks/project.json
file, for example"clusterId": "1234-567890-abcd12e3"
.This procedure is complete.
In the Configuration pane, do the following:
Next to Cluster, click the gear (Configure cluster) icon.
In the Command Palette, click the cluster that you want to use.
The extension adds the cluster’s ID to your code project’s
.databricks/project.json
file, for example"clusterId": "1234-567890-abcd12e3"
.This procedure is complete.
Create a new cluster
If you do not have an existing Databricks cluster, or you want to create a new one and use it, do the following:
In the Configuration pane, next to Cluster, click the gear (Configure cluster) icon.
In the Command Palette, click Create New Cluster.
When prompted to open the external website (your Databricks workspace), click Open.
If prompted, sign in to your Databricks workspace.
Follow the instructions to create a cluster.
Note
Databricks recommends that you create a Personal Compute cluster. This enables you to start running workloads immediately, minimizing compute management overhead.
After the cluster is created and is running, go back to Visual Studio Code.
Do one of the following:
In the Clusters pane, next to the cluster that you want to use, click the plug (Attach cluster) icon.
Tip
If the cluster is not visible, click the filter (Filter clusters) icon to see All clusters, clusters that are Created by me, or Running clusters. Or, click the Refresh icon.
The extension adds the cluster’s ID to the code project’s
.databricks/project.json
file, for example"clusterId": "1234-567890-abcd12e3"
.This procedure is complete.
In the Configuration pane, next to Cluster, click the gear (Configure cluster) icon.
In the Command Palette, click the cluster that you want to use.
The extension adds the cluster’s ID to the code project’s
.databricks/project.json
file, for example"clusterId": "1234-567890-abcd12e3"
.
Set the repository
With the extension and your code project opened, and a Databricks configuration profile already set, use the Databricks extension for Visual Studio Code to create a new repository in Databricks Repos and use it, or select an existing repository in Databricks Repos that you created earlier with the Databricks extension for Visual Studio Code and want to reuse instead.
Note
The Databricks extension for Visual Studio Code works only with repositories that it creates. You cannot use an existing repository in your workspace unless you used the Databricks extension for Visual Studio Code earlier to create that repository, and you now want to reuse that repository in your current Visual Studio Code project.
Create a new repo
To create a new repository, do the following:
In the Configuration pane, next to Sync Destination, click the gear (Configure sync destination) icon.
In the Command Palette, click Create New Sync Destination.
Type a name for the new repository in Databricks Repos, and then press Enter.
The extension appends the characters
.ide
to the end of the repo’s name and then adds the repo’s workspace path to the code project’s.databricks/project.json
file, for example"workspacePath": "/Workspace/Repos/someone@example.com/my-repo.ide"
.Note
If the remote repo’s name does not match your local code project’s name, a warning icon appears with this message:
The remote repo name does not match the current Visual Studio Code workspace name
. You can ignore this warning if you intend to synchronize your local code project with a repo in your remote Databricks workspace and the names of your local code project and remote repo do not match.After you set the repository, begin synchronizing with the repository by clicking the Start synchronization icon next to Sync Destination.
Warning
After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.
Important
The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.
Reuse an existing repo
If you have an existing repository in Databricks Repos that you created earlier with the Databricks extension for Visual Studio Code and want to reuse in your current Visual Studio Code project, then do the following:
In the Configuration pane, next to Sync Destination, click the gear (Configure sync destination) icon.
In the Command Palette, select the repository’s name from the list.
The extension adds the repo’s workspace path to the code project’s
.databricks/project.json
file, for example"workspacePath": "/Workspace/Repos/someone@example.com/my-repo.ide"
.Note
If the remote repo’s name does not match your local code project’s name, a warning icon appears with this message:
The remote repo name does not match the current Visual Studio Code workspace name
. You can ignore this warning if you intend to synchronize your local code project with a repo in your remote Databricks workspace and the names of your local code project and remote repo do not match.After you set the repository, begin synchronizing with the repository by clicking the Start synchronization icon next to Sync Destination.
Warning
After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.
Important
The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.
Initiate repo changes from the workspace instead of from Visual Studio Code
The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must also do the following to have those changes show in your local Visual Studio Code project:
Create a new, empty repository with a supported Git provider. This new, empty repository must have no prior commits. To learn how to create this repository, see your Git provider’s documentation.
Install Git on your local development machine, if you have not done so already.
In Visual Studio Code open the Command Palette (View > Command Palette), type clone, and then select Git: Clone.
In the Command Palette, for Provide repository URL or pick a repository source, enter the repository’s clone URL as specified by your Git provider, and then press Enter.
Select a parent folder on your local development machine in which to clone the repository’s contents (for example, the root folder of your local user’s home directory), and then click Select as Repository Destination.
When prompted to open the cloned repository, click Open.
From your new, empty code project folder that was just opened, use the Databricks extension for Visual Studio Code to Set up authentication and then Set the cluster.
In the root of your new, empty code project folder, create a
.gitignore
file, add a.databricks/
entry to this file, and then save this file. This prevents the hidden.databricks/
folder and its contents that the Databricks extension for Visual Studio Code generates from accidentally being checked into source control.Use the Databricks extension for Visual Studio Code to Create a new repo in your remote Databricks workspace and then connect to it.
Switch the existing connection for the new repo in your remote Databricks workspace to the new, empty repository with your Git provider, as follows:
First, configure your Databricks workspace with your Git provider credentials, by following the instructions in Add Git credentials to Databricks.
In the Databricks extension for Visual Studio Code in the Configuration pane, next to Sync Destination, click the linked chain (Open link externally) icon.
When prompted to open the external website, click Open.
If prompted, follow the on-screen instructions to sign in to your Databricks workspace.
In the workspace’s Repos pane, click the drop-down arrow next to the new repo’s name, and then click Git.
On the Settings tab, for Git repository URL, replace the exising value of
https://github.com/databricks/databricks-empty-ide-project.git
with the repository clone URL for the new, empty repository with your Git provider.For Git provider, select the name of your Git provider.
Click Save.
Create or copy over any files that you want to work on into your new, empty code project folder in Visual Studio Code. Do not create or copy any files yet into the repo in your Databricks workspace nor into the repo with your Git provider.
In the Databricks extension for Visual Studio Code, in the Configuration pane, next to Sync Destination, click the arrowed circle (Start synchronization) icon. The extension copies the files from your code project folder in Visual Studio Code into the new repo in your Databricks workspace.
As you continue to make any changes to your code project folder in Visual Studio Code, these changes are automatically synchronized to the new repo in your remote Databricks workspace.
If you make any changes to the new repo in your remote Databricks workspace, commit and push those changes to the connected repository with your Git provider.
Pull the changes from the repository with your Git provider into your local Visual Studio Code project. For instructions, see Working with GitHub in VS Code or your Git provider’s documentation.
Development tasks
After you configure the Databricks extension for Visual Studio Code, you can use the extension to run a local Python file on a cluster in a remote Databricks workspace, or run a local Python file or local Python, R, Scala, or SQL notebook as a job in a remote workspace, as follows.
If you do not have a local file or notebook available to test the Databricks extension for Visual Studio Code with, here is some basic code that you can add to your project:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession.builder.getOrCreate()
schema = StructType([
StructField('CustomerID', IntegerType(), False),
StructField('FirstName', StringType(), False),
StructField('LastName', StringType(), False)
])
data = [
[ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
[ 1001, 'Joost', 'van Brunswijk' ],
[ 1002, 'Stan', 'Bokenkamp' ]
]
customers = spark.createDataFrame(data, schema)
customers.show()
# Output:
#
# +----------+---------+-------------------+
# |CustomerID|FirstName| LastName|
# +----------+---------+-------------------+
# | 1000| Mathijs|Oosterhout-Rijntjes|
# | 1001| Joost| van Brunswijk|
# | 1002| Stan| Bokenkamp|
# +----------+---------+-------------------+
# Databricks notebook source
from pyspark.sql.types import *
schema = StructType([
StructField('CustomerID', IntegerType(), False),
StructField('FirstName', StringType(), False),
StructField('LastName', StringType(), False)
])
data = [
[ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
[ 1001, 'Joost', 'van Brunswijk' ],
[ 1002, 'Stan', 'Bokenkamp' ]
]
customers = spark.createDataFrame(data, schema)
customers.show()
# Output:
#
# +----------+---------+-------------------+
# |CustomerID|FirstName| LastName|
# +----------+---------+-------------------+
# | 1000| Mathijs|Oosterhout-Rijntjes|
# | 1001| Joost| van Brunswijk|
# | 1002| Stan| Bokenkamp|
# +----------+---------+-------------------+
# Databricks notebook source
library(SparkR)
sparkR.session()
data <- list(
list(1000L, "Mathijs", "Oosterhout-Rijntjes"),
list(1001L, "Joost", "van Brunswijk"),
list(1002L, "Stan", "Bokenkamp")
)
schema <- structType(
structField("CustomerID", "integer"),
structField("FirstName", "string"),
structField("LastName", "string")
)
df <- createDataFrame(
data = data,
schema = schema
)
showDF(df)
# Output:
#
# +----------+---------+-------------------+
# |CustomerID|FirstName| LastName|
# +----------+---------+-------------------+
# | 1000| Mathijs|Oosterhout-Rijntjes|
# | 1001| Joost| van Brunswijk|
# | 1002| Stan| Bokenkamp|
# +----------+---------+-------------------+
// Databricks notebook source
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val schema = StructType(Array(
StructField("CustomerID", IntegerType, false),
StructField("FirstName", StringType, false),
StructField("LastName", StringType, false)
))
val data = List(
Row(1000, "Mathijs", "Oosterhout-Rijntjes"),
Row(1001, "Joost", "van Brunswijk"),
Row(1002, "Stan", "Bokenkamp"),
)
val rdd = spark.sparkContext.makeRDD(data)
val customers = spark.createDataFrame(rdd, schema)
display(customers)
// Output:
//
// +----------+---------+-------------------+
// |CustomerID|FirstName| LastName|
// +----------+---------+-------------------+
// | 1000| Mathijs|Oosterhout-Rijntjes|
// | 1001| Joost| van Brunswijk|
// | 1002| Stan| Bokenkamp|
// +----------+---------+-------------------+
-- Databricks notebook source
CREATE TABLE IF NOT EXISTS zzz_customers(
CustomerID INT,
FirstName STRING,
LastName STRING
);
-- COMMAND ----------
INSERT INTO zzz_customers VALUES
(1000, "Mathijs", "Oosterhout-Rijntjes"),
(1001, "Joost", "van Brunswijk"),
(1002, "Stan", "Bokenkamp");
-- COMMAND ----------
SELECT * FROM zzz_customers;
-- Output:
--
-- +----------+---------+-------------------+
-- |CustomerID|FirstName| LastName|
-- +----------+---------+-------------------+
-- | 1000| Mathijs|Oosterhout-Rijntjes|
-- | 1001| Joost| van Brunswijk|
-- | 1002| Stan| Bokenkamp|
-- +----------+---------+-------------------+
-- COMMAND ----------
DROP TABLE zzz_customers;
Enable PySpark and Databricks Utilities code completion
To enable IntelliSense (also known as code completion) in the Visual Studio Code code editor for PySpark, Databricks Utilities, and related globals such as spark
and dbutils
, do the following with your code project opened:
On the Command Palette (View > Command Palette), type
Databricks: Configure autocomplete for Databricks globals
and press Enter.Follow the on-screen prompts to allow the Databricks extension for Visual Studio Code to install PySpark for your project, and to add or modify the
__builtins__.pyi
file for your project to enable Databricks Utilities.
You can now use globals such as spark
and dbutils
in your code without declaring any related import
statements beforehand.
Run a Python file on a cluster
With the extension and your code project opened, and a Databricks configuration profile, cluster, and repo already set, do the following:
In your code project, open the Python file that you want to run on the cluster.
Do one of the following:
In Explorer view (View > Explorer), right-click the file, and then select Run File on Databricks from the context menu.
In the file editor’s title bar, click the drop-down arrow next to the play (Run or Debug) icon. Then in the drop-down list, click Run File on Databricks.
The file runs on the cluster, and any output is printed to the Debug Console (View > Debug Console).
Run a Python file as a job
With the extension and your code project opened, and a Databricks configuration profile, cluster, and repo already set, do the following:
In your code project, open the Python file that you want to run as a job.
Do one of the following:
In Explorer view (View > Explorer), right-click the file, and then select Run File as Workflow on Databricks from the context menu.
In the file editor’s title bar, click the drop-down arrow next to the play (Run or Debug) icon. Then in the drop-down list, click Run File as Workflow on Databricks.
A new editor tab appears, titled Databricks Job Run. The file runs as a job in the workspace, and any output is printed to the new editor tab’s Output area.
To view information about the job run, click the Task run ID link in the new Databricks Job Run editor tab. Your workspace opens and the job run’s details are displayed in the workspace.
Run a Python notebook as a job
With the extension and your code project opened, and a Databricks configuration profile, cluster, and repo already set, do the following:
In your code project, open the Python notebook that you want to run as a job.
Tip
To create a Python notebook file in Visual Studio Code, begin by clicking File > New File, select Python File, and save the new file with a
.py
file extension.To turn the
.py
file into a Databricks notebook, add the special comment# Databricks notebook source
to the beginning of the file, and add the special comment# COMMAND ----------
before each cell. For more information, see Import a file and convert it to a notebook.Do one of the following:
In Explorer view (View > Explorer), right-click the notebook file, and then select Run File as Workflow on Databricks from the context menu.
In the notebook file editor’s title bar, click the drop-down arrow next to the play (Run or Debug) icon. Then in the drop-down list, click Run File as Workflow on Databricks.
A new editor tab appears, titled Databricks Job Run. The notebook runs as a job in the workspace, and the notebook and its output are displayed in the new editor tab’s Output area.
To view information about the job run, click the Task run ID link in the Databricks Job Run editor tab. Your workspace opens and the job run’s details are displayed in the workspace.
Run an R, Scala, or SQL notebook as a job
With the extension and your code project opened, and a Databricks configuration profile, cluster, and repo already set, do the following:
In your code project, open the R, Scala, or SQL notebook that you want to run as a job.
Tip
To create an R, Scala, or SQL notebook file in Visual Studio Code, begin by clicking File > New File, select Python File, and save the new file with a
.r
,.scala
, or.sql
file extension, respectively.To turn the
.r
,.scala
, or.sql
file into a Databricks notebook, add the special commentDatabricks notebook source
to the beginning of the file and add the special commentCOMMAND ----------
before each cell. Be sure to use the correct comment marker for each language (#
for R,//
for Scala, and--
for SQL). For more information, see Import a file and convert it to a notebook.This is similar to the pattern for Python notebooks:
In Run and Debug view (View > Run), select Run on Databricks as Workflow from the drop-down list, and then click the green play arrow (Start Debugging) icon.
Note
If Run on Databricks as Workflow is not available, see Create a custom run configuration.
A new editor tab appears, titled Databricks Job Run. The notebook runs as a job in the workspace. The notebook and its output are displayed in the new editor tab’s Output area.
To view information about the job run, click the Task run ID link in the Databricks Job Run editor tab. Your workspace opens and the job run’s details are displayed in the workspace.
Advanced tasks
You can use the Databricks extension for Visual Studio Code to perform the following advanced tasks.
Run tests with pytest
You can run pytest on local code that does not need a connection to a cluster in a remote Databricks workspace. For example, you might use pytest
to test your functions that accept and return PySpark DataFrames in local memory. To get started with pytest
and run it locally, see Get Started in the pytest
documentation.
To run pytest
on code in a remote Databricks workspace, do the following in your Visual Studio Code project:
Step 1: Create the tests
Add a Python file with the following code, which contains your tests to run. This example assumes that this file is named spark_test.py
and is at the root of your Visual Studio Code project. This file contains a pytest
fixture, which makes the cluster’s SparkSession
(the entry point to Spark functionality on the cluster) available to the tests. This file contains a single test that checks whether the specified cell in the table contains the specified value. You can add your own tests to this file as needed.
from pyspark.sql import SparkSession
import pytest
@pytest.fixture
def spark() -> SparkSession:
# Create a SparkSession (the entry point to Spark functionality) on
# the cluster in the remote Databricks workspace. Unit tests do not
# have access to this SparkSession by default.
return SparkSession.builder.getOrCreate()
# Now add your unit tests.
# For example, here is a unit test that must be run on the
# cluster in the remote Databricks workspace.
# This example determines whether the specified cell in the
# specified table contains the specified value. For example,
# the third colum in the first row should contain the word "Ideal":
#
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# |_c0 | carat | cut | color | clarity | depth | table | price | x | y | z |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# | 1 | 0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3. 98 | 2.43 |
# +----+-------+-------+-------+---------+-------+-------+-------+------+-------+------+
# ...
#
def test_spark(spark):
spark.sql('USE default')
data = spark.sql('SELECT * FROM diamonds')
assert data.collect()[0][2] == 'Ideal'
Step 2: Create the pytest runner
Add a Python file with the following code, which instructs pytest
to run your tests from the previous step. This example assumes that the file is named pytest_databricks.py
and is at the root of your Visual Studio Code project.
import pytest
import os
import sys
# Run all tests in the connected repository in the remote Databricks workspace.
# By default, pytest searches through all files with filenames ending with
# "_test.py" for tests. Within each of these files, pytest runs each function
# with a function name beginning with "test_".
# Get the path to the repository for this file in the workspace.
repo_root = os.path.dirname(os.path.realpath(__file__))
# Switch to the repository's root directory.
os.chdir(repo_root)
# Skip writing .pyc files to the bytecode cache on the cluster.
sys.dont_write_bytecode = True
# Now run pytest from the repository's root directory, using the
# arguments that are supplied by your custom run configuration in
# your Visual Studio Code project. In this case, the custom run
# configuration JSON must contain these unique "program" and
# "args" objects:
#
# ...
# {
# ...
# "program": "${workspaceFolder}/path/to/this/file/in/workspace",
# "args": ["/path/to/_test.py-files"]
# }
# ...
#
retcode = pytest.main(sys.argv[1:])
Step 3: Create a custom run configuration
To instruct pytest
to run your tests, you must create a custom run configuration. Use the existing Databricks cluster-based run configuration to create your own custom run configuration, as follows:
On the main menu, click Run > Add configuration.
In the Command Palette, select Databricks.
Visual Studio Code adds a
.vscode/launch.json
file to your project, if this file does not already exist.Change the starter run configuration as follows, and then save the file:
Change this run configuration’s name from
Run on Databricks
to some unique display name for this configuration, in this exampleUnit Tests (on Databricks)
.Change
program
from${file}
to the path in the project that contains the test runner, in this example${workspaceFolder}/pytest_databricks.py
.Change
args
from[]
to the path in the project that contains the files with your tests, in this example["."]
.
Your
launch.json
file should look like this:{ // Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [ { "type": "databricks", "request": "launch", "name": "Unit Tests (on Databricks)", "program": "${workspaceFolder}/pytest_databricks.py", "args": ["."], "env": {} } ] }
Step 4: Run the tests
Make sure that pytest
is already installed on the cluster first. For example, with the cluster’s settings page open in your Databricks workspace, do the following:
On the Libraries tab, if pytest is visible, then
pytest
is already installed. If pytest is not visible, click Install new.For Library Source, click PyPI.
For Package, enter
pytest
.Click Install.
Wait until Status changes from Pending to Installed.
To run the tests, do the following from your Visual Studio Code project:
On the main menu, click View > Run.
In the Run and Debug list, click Unit Tests (on Databricks), if it is not already selected.
Click the green arrow (Start Debugging) icon.
The pytest
results display in the Debug Console (View > Debug Console on the main menu). For example, these results show that at least one test was found in the spark_test.py
file, and a dot (.
) means that a single test was found and passed. (A failing test would show an F
.)
<date>, <time> - Creating execution context on cluster <cluster-id> ...
<date>, <time> - Synchronizing code to /Repos/<someone@example.com>/<your-repository-name> ...
<date>, <time> - Running /pytest_databricks.py ...
============================= test session starts ==============================
platform linux -- Python <version>, pytest-<version>, pluggy-<version>
rootdir: /Workspace/Repos/<someone@example.com>/<your-repository-name>
collected 1 item
spark_test.py . [100%]
============================== 1 passed in 3.25s ===============================
<date>, <time> - Done (took 10818ms)
Change authentication settings
You can change the Databricks authentication settings that the extension is set to, as follows.
With the extension and your code project opened, do the following:
In the Configuration pane, click the gear (Configure workspace) icon.
Follow the steps in Set up authentication.
Change the cluster
You can change the cluster that the extension is set to, as follows.
With the extension and your code project opened, and a Databricks configuration profile and cluster already set, select an existing Databricks cluster that you want to change to, or create a new Databricks cluster and change to it.
Change to an existing cluster
If you have an existing Databricks cluster that you want to change to, do one of the following:
In the Clusters pane, next to the cluster that you want to change to, click the plug (Attach cluster) icon.
Tip
If the cluster is not visible in the Clusers pane, click the filter (Filter clusters) icon to see All clusters, clusters that are Created by me, or Running clusters. Or, click the Refresh icon next to the filter icon.
The extension replaces the cluster’s ID in your code project’s
.databricks/project.json
file, for example"clusterId": "1234-567890-abcd12e3"
.This procedure is complete.
In the Configuration pane, do the following:
Next to Cluster, click the gear (Configure cluster) icon.
In the Command Palette, click the cluster that you want to change to.
The extension replaces the cluster’s ID in your code project’s
.databricks/project.json
file, for example"clusterId": "1234-567890-abcd12e3"
.This procedure is complete.
Create a new cluster and change to it
If you want to create a new Databricks cluster and change to it, do the following:
In the Configuration pane, next to Cluster, click the gear (Configure cluster) icon.
In the Command Palette, click Create New Cluster.
When prompted to open the external website (your Databricks workspace), click Open.
If prompted, sign in to your Databricks workspace.
Follow the instructions to create a cluster.
Note
Databricks recommends that you create a Personal Compute cluster. This enables you to start running workloads immediately, minimizing compute management overhead.
After the cluster is created and is running, go back to Visual Studio Code.
Do one of the following:
In the Clusters pane, next to the cluster that you want to change to, click the plug (Attach cluster) icon.
Tip
If the cluster is not visible, click the filter (Filter clusters) icon to see All clusters, clusters that are Created by me, or Running clusters. Or, click the Refresh icon.
The extension replaces the cluster’s ID in your code project’s
.databricks/project.json
file, for example"clusterId": "1234-567890-abcd12e3"
.This procedure is complete.
In the Configuration pane, next to Cluster, click the gear (Configure cluster) icon.
In the Command Palette, click the cluster that you want to change to.
The extension replaces the cluster’s ID in your code project’s
.databricks/project.json
file, for example"clusterId": "1234-567890-abcd12e3"
.
Change the repository
You can change the repository that the extension is set to, as follows.
With the extension and your code project opened, and a Databricks configuration profile already set, select an existing repository in Databricks Repos that you want to change to, or create a new respository in Databricks Repos and change to it.
Note
The Databricks extension for Visual Studio Code works only with repositories that it creates. You cannot use an existing repository in your workspace unless you used the Databricks extension for Visual Studio Code earlier to create that repository, and you now want to reuse that repository in your current Visual Studio Code project.
Change to an existing repo
If you have an existing repository in Databricks Repos that you want to change to, then do the following:
In the Configuration pane, next to Sync Destination, click the gear (Configure sync destination) icon.
In the Command Palette, select the repository’s name from the list.
The extension replaces the repository’s workspace path in your code project’s
.databricks/project.json
file, for example"workspacePath": "/Workspace/Repos/someone@example.com/my-repo.ide"
.Note
If the remote repo’s name does not match your local code project’s name, a warning icon appears with this message:
The remote repo name does not match the current Visual Studio Code workspace name
. You can ignore this warning if you intend to synchronize your local code project with a repo in your remote Databricks workspace and the names of your local code project and remote repo do not match.After you set the repository, begin synchronizing with the repository by clicking the Start synchronization icon next to Sync Destination.
Warning
After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.
Important
The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.
Create a new repo and change to it
If you want to create a new repository in Databricks Repos and change to it, do the following:
In the Configuration pane, next to Sync Destination, click the gear (Configure sync destination) icon.
In the Command Palette, click Create New Sync Destination.
Type a name for the new repository in Databricks Repos, and then press Enter.
The extension appends the characters
.ide
to the end of the repo’s name and then replaces the repository’s workspace path in your code project’s.databricks/project.json
file, for example"workspacePath": "/Workspace/Repos/someone@example.com/my-repo.ide"
.Note
If the remote repo’s name does not match your local code project’s name, a warning icon appears with this message:
The remote repo name does not match the current Visual Studio Code workspace name
. You can ignore this warning if you intend to synchronize your local code project with a repo in your remote Databricks workspace and the names of your local code project and remote repo do not match.After you set the repository, begin synchronizing with the repository by clicking the Start synchronization icon next to Sync Destination.
Warning
After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.
Important
The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.
Start or stop the cluster
You can start or stop the cluster that the extension is set to, as follows.
With the extension and your code project opened, and a Databricks configuration profile and cluster already set, do one of the following:
To start the cluster, in the Configuration pane, next to Cluster, click the play (Start Cluster) icon.
To stop the cluster, in the Configuration pane, next to Cluster, click the stop (Stop Cluster) icon.
Stop synchronizing the repository
You can stop synchronizing with the repository that the extension is set to, as follows.
With the extension and your code project opened, and a Databricks configuration profile already set, in the Configuration pane (next to Sync Destination), click the Stop synchronization icon.

To restart a stopped synchronization, click the Start synchronization icon.

Warning
After you set the repository and then begin synchronizing, any existing files in your remote workspace repo that have the same filenames in your local code project will have their contents forcibly overwritten. This is because the Databricks extension for Visual Studio Code treats the files in your local code project as the “single source of truth” for both your local code project and its connected remote repo within your Databricks workspace.
Important
The Databricks extension for Visual Studio Code only performs one-way, automatic synchronization of file changes from your local Visual Studio Code project to the related repository in your remote Databricks workspace. Note that the reverse is not true: file changes in your remote Databricks workspace are not automatically synchronized to your local Visual Studio Code project. Therefore, Databricks does not recommend that you initiate file changes in your Databricks workspace. If you absolutely cannot avoid making such workspace-initiated changes, then you must take special action to have those changes show in your local Visual Studio Code project. For instructions, see Initiate repo changes from the workspace instead of from Visual Studio Code.
Create a custom run configuration
You can create custom run configurations in Visual Studio Code to do things such as passing custom arguments to a job or a notebook, or creating different run settings for different files. For example, the following custom run configuration passes the --prod
argument to the job:
{
"version": "0.2.0",
"configurations": [
{
"type": "databricks-workflow",
"request": "launch",
"name": "Run on Databricks as Workflow",
"program": "${file}",
"parameters": {},
"args": ["--prod"],
"preLaunchTask": "databricks: sync"
}
]
}
To create a custom run configuration, click Run > Add Configuration from the main menu in Visual Studio Code. Then select either Databricks for a cluster-based run configuration or Databricks: Workflow for a job-based run configuration.
By using custom run configurations, you can also pass in command-line arguments and run your code just by pressing F5. For more information, see Launch configurations in the Visual Studio Code documentation.
Uninstall the extension
You can uninstall the Databricks extension for Visual Studio Code if needed, as follows:
In Visual Studio Code, click View > Extensions from the main menu.
In the list of extensions, select the Databricks for Visual Studio Code entry.
Click Uninstall.
Click Reload required, or restart Visual Studio Code.
Troubleshooting
Error when synchronizing through a proxy
Issue: When you try to run the Databricks extension for Visual Studio Code to synchronize your local code project through a proxy, an error message similar to the following appears, and the synchronization operation is unsuccessful: Get "https://<workspace-instance>/api/2.0/preview/scim/v2/Me": EOF
.
Possible cause: Visual Studio Code does not know how to find the proxy.
Recommended solution: Restart Visual Studio Code from your terminal by running the following command, and then try synchronizing again:
env HTTPS_PROXY=<proxy-url>:<port> code
In the preceding command:
Replace
<proxy-url>
with the full URL to your proxy.Replace
<port>
with the correct port on your proxy.
Error: “spawn unknown system error -86” when you try to synchronize local code
Issue: When you try to synchronize local code in a project to a remote Databricks workspace, the Terminal shows that synchronization has started but displays only the error message spawn unknown system error -86
. Also, the Sync Destination section of the Configuration pane remains in a pending state.
Possible cause: The wrong version of the Databricks extension for Visual Studio Code is installed for your development machine’s operating system.
Recommend solution: Uninstall the extension, and then Install and open the extension for your development machine’s operating system from the beginning.
Send usage logs to Databricks
If you have issues synchronizing local code to a remote Databricks workspace, you can send usage logs and related information to Databricks Support by doing the following:
Turn on verbose mode for the
bricks
command-line interface (CLI) by checking the Bricks: Verbose Mode setting, or settingdatabricks.bricks.verboseMode
totrue
, as described in Settings.Also turn on logging by checking the Logs: Enabled setting, or setting
databricks.logs.enabled
totrue
, as described in Settings. Be sure to restart Visual Studio Code after you turn on logging.Attempt to reproduce your issue.
From the Command Palette (View > Command Palette from the main menu), run the Databricks: Open full logs command.
Send the
bricks-logs.json
andsdk-and-extension-logs.json
files that appear to Databricks Support.Also copy the contents of the Terminal (View > Terminal) in the context of the issue, and send this content to Databricks Support.
To send error logs that are not about code synchronization issues to Databricks Support:
From the Command Palette (View > Command Palette), run the Databricks: Open full logs command.
Send only the
sdk-and-extension-logs.json
file that appears to Databricks Support.
The Output view (View > Output, Databricks Logs) shows truncated information if Logs: Enabled is checked or databricks.logs.enabled
is set to true
. To show more information, change the following settings, as described in Settings:
Logs: Max Array Length or
databricks.logs.maxArrayLength
Logs: Max Field Length or
databricks.logs.maxFieldLength
Logs: Truncation Depth or
databricks.logs.truncationDepth
Command Palette
The Databricks extension for Visual Studio Code adds the following commands to the Visual Studio Code Command Palette. See also Command Palette in the Visual Studio Code documentation.
Command |
Description |
---|---|
|
Enables IntelliSense in the Visual Studio Code code editor for PySpark, Databricks Utilities, and related globals such as |
|
Moves focus to the Commmand Palette to create, select, or change the Databricks cluster to use for the current project. See Set the cluster and Change the cluster. |
|
Moves focus to the Command Palette to create, select, or change the repository in Databricks Repos to use for the current project. See Set the repository and Change the repository. |
|
Moves focus to the Command Palette to create, select, or change Databricks authentication details to use for the current project. See Set up authentication. |
|
Removes the reference to the Databricks cluster from the current project. |
|
Removes the reference to the repository in Databricks Repo from the current project. |
|
Moves focus in the Databricks view to the Clusters pane. |
|
Moves focus in the Databricks view to the Configuration pane. |
|
Resets the Databricks view to show the Configure Databricks and Show Quickstart buttons in the Configuration pane.
Any content in the current project’s |
|
Opens the Databricks configuration profiles file, from the default location, for the current project. See Set up authentication. |
|
Opens the folder that contains the application log files that the Databricks extension for Visual Studio Code writes to your development machine. |
|
Shows the Quickstart file in the editor. |
|
Starts synchronizing the current project’s code to the Databricks workspace. This command performs an incremental synchronization. |
|
Starts synchronizing the current project’s code to the Databricks workspace. This command performs a full synchronization, even if an incremental sync is possible. |
|
Stops synchronizing the current project’s code to the Databricks workspace. See Stop synchronizing the repository. |
Settings
The Databricks extension for Visual Studio Code adds the following settings to Visual Studio Code. See also User and Workspace Settings in the Visual Studio Code documentation.
Setting (UI, Extensions > Databricks) |
Setting (JSON, settings.json) |
Description |
---|---|---|
Bricks: Verbose Mode |
|
Checked or set to |
Clusters: Only Show Accessible Clusters |
|
Checked or set to |
Logs: Enabled |
|
Checked or set to |
Logs: Max Array Length |
|
The maximum number of items to show for array fields. The default is |
Logs: Max Field Length |
|
The maximum length of each field displayed in the logs output panel. The default is |
Logs: Truncation Depth |
|
The maximum depth of logs to show without truncation. The default is |
Override Databricks Config File |
|
An alternate location for the |
Frequently asked questions (FAQs)
Do you have support for, or a timeline for support for, any of the following capabilities?
How does Databricks Connect relate to the Databricks extension for Visual Studio Code?
How does dbx by Databricks Labs relate to the Databricks extension for Visual Studio Code?
Can I use the Databricks extension for Visual Studio Code with a proxy?
Do you have support for, or a timeline for support for, any of the following capabilities?
Step-through debugging
Other languages, such as Scala or SQL
Delta Live Tables
Databricks SQL warehouses
Other IDEs, such as PyCharm
Additional libraries
Full CI/CD integration
Authentication schemes in addition to Databricks personal access tokens
Databricks is aware of these requests and is prioritizing work to enable simple scenarios for local development and remote running of code. Please forward additional requests and scenarios to your Databricks representative. Databricks will incorporate your input into future planning.
How does the Databricks Terraform provider relate to the Databricks extension for Visual Studio Code?
Databricks continues to recommend the Databricks Terraform provider for managing your CI/CD pipelines in a predictable way. Please let your Databricks representative know how you might use an IDE to manage your deployments in the future. Databricks will incorporate your input into future planning.
How does Databricks Connect relate to the Databricks extension for Visual Studio Code?
Databricks Connect users will likely keep using it for the foreseeable future. Databricks is thinking about how the Databricks extension for Visual Studio Code might provide functionality in the future that is similar to Databricks Connect.
How does dbx by Databricks Labs relate to the Databricks extension for Visual Studio Code?
The main features of dbx by Databricks Labs include:
Project scaffolding.
Limited local development through the
dbx execute
command.CI/CD for Databricks jobs.
The Databricks extension for Visual Studio Code enables local development and remotely running Python code files on Databricks clusters, and remotely running Python code files and notebooks in Databricks jobs. dbx
can continue to be used for project scaffolding and CI/CD for Databricks jobs.
What happens if I already have an existing Databricks configuration profile that I created through the Databricks CLI?
You can select your existing configuration profile when you configure the Databricks extension for Visual Studio Code. With the extension and your code project opened, do the following:
In the Configuration pane, click the gear (Configure workspace) icon.
Enter your workspace instance URL, for example
https://dbc-a1b2345c-d6e7.cloud.databricks.com
.In the Command Palette, select your existing configuration profile.
Which permissions do I need for a Databricks workspace to use the Databricks extension for Visual Studio Code?
You must have execute permissions for a Databricks cluster for running code, as well as permissions to create a repository in Databricks Repos.
Which settings must be enabled for a Databricks workspace to use the Databricks extension for Visual Studio Code?
The workspace must have the Files in Repos setting turned on. For instructions, see Configure support for Files in Repos. If you cannot turn on this setting yourself, contact your Databricks workspace administrator.
Can I use the Databricks extension for Visual Studio Code with a proxy?
Yes. See the recommended solution in Error when synchronizing through a proxy.
Can I use the Databricks extension for Visual Studio Code with an existing repository stored with a remote Git provider?
No. For a possible workaround, see the approach in Initiate repo changes from the workspace instead of from Visual Studio Code.