VSCode extension for Databricks tutorial: Run Python on a cluster and as a job

This tutorial demonstrates how to get started with the Databricks extension for Visual Studio Code by running a basic Python code file on a Databricks cluster and as a Databricks job run in your remote workspace. See What is the Databricks extension for Visual Studio Code?.

What will you do in this tutorial?

In this hands-on tutorial, you do the following:

  • Create a Databricks cluster to run your local Python code on.

  • Install Visual Studio Code and the Databricks extension for Visual Studio Code.

  • Set up Databricks authentication and configure the Databricks extension for Visual Studio Code with this information.

  • Configure the Databricks extension for Visual Studio Code with information about your remote cluster, and have the extension to start the cluster.

  • Configure the Databricks extension for Visual Studio Code with the location in your remote Databricks workspace to upload your local Python code to, and have the extension start listening for code upload events.

  • Write and save some Python code, which triggers a code upload event.

  • Use the Databricks extension for Visual Studio Code to run the uploaded code on your remote cluster and then to run it with your cluster as a remote job run.

This tutorial demonstrates only how to run a Python code file, and this tutorial demonstrates only how to set up OAuth user-to-machine (U2M) authentication. To learn how to debug Python code files, run and debug notebooks, and set up other authentication types, see Next steps.

Step 1: Create a cluster

If you already have a remote Databricks cluster that you want to use, make a note of the cluster’s name, and skip ahead to the next Step. To view your available clusters, in your workspace’s sidebar, click Compute.

Databricks recommends that you create a Personal Compute cluster to get started quickly. To create this cluster, do the following:

  1. In your Databricks workspace, on the sidebar, click Compute.

  2. Click Create with Personal Compute.

  3. Click Create compute.

  4. Make a note of your cluster’s name, as you will need it later in Step 5.

Step 2: Install Visual Studio Code

To install Visual Studio Code, follow the instructions for macOS, Linux, or Windows.

If you already have Visual Studio Code installed, check whether it is version 1.69.1 or above. To do this, in Visual Studio Code, on the main menu, click Code > About Visual Studio Code for macOS or Help > About for Linux or Windows.

To update Visual Studio Code, on the main menu, click Code > Check for Updates for macOS or Help > Check for Updates for Linux or Windows.

Step 3: Install the Databricks extension

Install the Visual Studio Code extension
  1. In the Visual Studio Code sidebar, click the Extensions icon.

  2. In Search Extensions in Marketplace, enter Databricks.

  3. In the entry labelled Databricks with the subtitle IDE support for Databricks by Databricks, click Install.

Step 4: Set up Databricks authentication

In this step, you enable authentication between the Databricks extension for Visual Studio Code and your remote Databricks workspace, as follows:

  1. From Visual Studio Code, open the folder that will contain the Python code that you add in Step 7. On the main menu, click File > Open Folder and follow the on-screen directions.

  2. On the Visual Studio Code sidebar, click the Databricks logo icon.

  3. In the Configuration pane, click Configure Databricks.

  4. In the Command Palette, for Databricks Host, enter your workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com. Then press Enter.

  5. Select OAuth (user to machine).

  6. Complete the on-screen instructions in your web browser to finish authenticating with your Databricks account. If prompted, allow all-apis access.

Step 5: Add cluster information to the Databricks extension and start the cluster

  1. With the Configuration pane already open from the previous Step, next to Cluster, click the gear (Configure cluster) icon.

  2. In the Command Palette, select the name of the cluster that you created in Step 1.

  3. Start the cluster, if it is not already started: next to Cluster, if the play (Start Cluster) icon is visible, click it.

Start the cluster

Step 6: Add the code upload location to the Databricks extension and start the upload listener

  1. With the Configuration pane already open from the previous Step, next to Sync Destination, click the gear (Configure sync destination) icon.

  2. In the Command Palette, select Create New Sync Destination.

  3. Press Enter to confirm the generated remote upload directory name.

  4. Start the upload listener, if it is not already started: next to Sync Destination, if the arrowed circle (Start synchronization) icon is visible, click it.

Start the upload listener

Step 7: Create and run Python code

  1. Create a local Python code file: on the sidebar, click the folder (Explorer) icon.

  2. On the main menu, click File > New File. Name the file demo.py and save it to the project’s root.

  3. Add the following code to the file and then save it. This code creates and displays the contents of a basic PySpark DataFrame:

    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    
    spark = SparkSession.builder.getOrCreate()
    
    schema = StructType([
       StructField('CustomerID', IntegerType(), False),
       StructField('FirstName',  StringType(),  False),
       StructField('LastName',   StringType(),  False)
    ])
    
    data = [
       [ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
       [ 1001, 'Joost',   'van Brunswijk' ],
       [ 1002, 'Stan',    'Bokenkamp' ]
    ]
    
    customers = spark.createDataFrame(data, schema)
    customers.show()
    
    # Output:
    #
    # +----------+---------+-------------------+
    # |CustomerID|FirstName|           LastName|
    # +----------+---------+-------------------+
    # |      1000|  Mathijs|Oosterhout-Rijntjes|
    # |      1001|    Joost|      van Brunswijk|
    # |      1002|     Stan|          Bokenkamp|
    # +----------+---------+-------------------+
    
  4. In the Explorer view, right-click the demo.py file, and then click Upload and Run File on Databricks. The output appears in the Debug Console pane.

Upload and Run File on Databricks

Step 8: Run the code as a job

In the previous Step, you ran your Python code directly on the remote cluster. In this step, you initiate a workflow that uses the cluster to run the code as a Databricks job instead. See What is Databricks Jobs?.

To run this code as a job, in the Explorer view, right-click the demo.py file, and then click Run File as Workflow on Databricks. The output appears in a separate editor tab next to the demo.py file editor.

Run File as Workflow on Databricks

Next steps

Now that you have successfully used the Databricks extension for Visual Studio Code to upload a local Python file and run it remotely, learn more about how to use the extension: