VSCode extension for Databricks tutorial: Run Python on a cluster
This tutorial demonstrates how to quickly get started with the Databricks extension for Visual Studio Code by running a basic Python code file on a Databricks cluster in your remote workspace.
What does the Databricks extension do?
The Databricks extension for Visual Studio Code enables you to connect to your remote Databricks workspaces from the Visual Studio Code integrated development environment (IDE) running on your local development machine. Through these connections, you can:
Synchronize local code that you develop in Visual Studio Code with code in your remote workspaces.
Run local Python code files from Visual Studio Code on Databricks clusters in your remote workspaces.
Run local Python code files (
.py
) and Python, R, Scala, and SQL notebooks (.py
,.ipynb
,.r
,.scala
, and.sql
) from Visual Studio Code as automated Databricks jobs in your remote workspaces.
Note
The Databricks extension for Visual Studio Code supports running R, Scala, and SQL notebooks as automated jobs but does not provide any deeper support for these languages within Visual Studio Code.
Requirements
This following hands-on tutorial assumes:
You already have Visual Studio Code 1.69.1 or higher installed and configured for Python coding. See Setting up Visual Studio Code and Getting Started with Python in VS Code.
Visual Studio Code is already running and has a local project opened.
You have already generated a Databricks personal access token for your target Databricks workspace. See Databricks personal access token authentication.
You have already added your Databricks personal access token as a
token
field along with your workspace instance URL, for examplehttps://dbc-a1b2345c-d6e7.cloud.databricks.com
, as ahost
field to theDEFAULT
configuration profile in your local.databrickscfg
file. See Databricks configuration profiles.
Configure the VSCode extension
Follow these steps:
Install the extension: on the Databricks extension for Visual Studio Code page in the Visual Studio Code Marketplace, click Install. To complete the installation, follow the on-screen instructions.
Open the extension: On the sidebar, click the Databricks logo.
Start configuring the extension: In the Configuration pane, click Configure Databricks.
Set the Databricks workspace: In the Command Palette, for Databricks Host, enter your workspace instance URL, for example
https://dbc-a1b2345c-d6e7.cloud.databricks.com
. Then press Enter.Click the entry DEFAULT: Authenticate using the DEFAULT profile.
Set the Databricks cluster: In the Configuration pane, click Cluster, and then click the gear (Configure cluster) icon.
Click the entry for the cluster that you want to use.
Start the cluster, if it is not already started: In the Configuration pane, next to Cluster, click the play (Start Cluster) icon.
Set the sync destination: In the Configuration pane, click Sync Destination, and then click the gear (Configure cluster) icon.
In the Command Palette, click the sync destination name that is randomly generated by the extension.
Create and run Python code
Create a basic, local Python code file to sync and run: On the sidebar, click the Explorer logo.
On the main menu, click File > New File. Name the file demo.py and save it to the project root.
Add the following code to the file and then save it. This code creates and displays the contents of a basic PySpark DataFrame:
from pyspark.sql import SparkSession from pyspark.sql.types import * spark = SparkSession.builder.getOrCreate() schema = StructType([ StructField('CustomerID', IntegerType(), False), StructField('FirstName', StringType(), False), StructField('LastName', StringType(), False) ]) data = [ [ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ], [ 1001, 'Joost', 'van Brunswijk' ], [ 1002, 'Stan', 'Bokenkamp' ] ] customers = spark.createDataFrame(data, schema) customers.show() # Output: # # +----------+---------+-------------------+ # |CustomerID|FirstName| LastName| # +----------+---------+-------------------+ # | 1000| Mathijs|Oosterhout-Rijntjes| # | 1001| Joost| van Brunswijk| # | 1002| Stan| Bokenkamp| # +----------+---------+-------------------+
In the Configuration pane, next to Sync Destination, click the circled arrows (Start synchronization) icon.
In the Explorer view, right-click the
demo.py
file, and then click Upload and Run File on Databricks. The output appears in the Debug Console pane.
Next steps
Now that you have successfully used the Databricks extension for Visual Studio Code to run a basic Python file, learn more about how to use the extension:
Learn about additional ways to set up authentication for the extension, beyond Databricks personal access token authentication. See Authentication setup for the Databricks extension for Visual Studio Code.
Learn how to enable PySpark and Databricks Utilities code completion, run or debug Python code with Databricks Connect, run a file or a notebook as a Databricks job, run tests with
pytest
, use environment variable definitions files, create custom run configurations, and more. See Development tasks for the Databricks extension for Visual Studio Code.