VSCode extension for Databricks tutorial: Run Python on a cluster

This tutorial demonstrates how to quickly get started with the Databricks extension for Visual Studio Code by running a basic Python code file on a Databricks cluster in your remote workspace.

What does the Databricks extension do?

The Databricks extension for Visual Studio Code enables you to connect to your remote Databricks workspaces from the Visual Studio Code integrated development environment (IDE) running on your local development machine. Through these connections, you can:

  • Synchronize local code that you develop in Visual Studio Code with code in your remote workspaces.

  • Run local Python code files from Visual Studio Code on Databricks clusters in your remote workspaces.

  • Run local Python code files (.py) and Python, R, Scala, and SQL notebooks (.py, .ipynb, .r, .scala, and .sql) from Visual Studio Code as automated Databricks jobs in your remote workspaces.

Note

The Databricks extension for Visual Studio Code supports running R, Scala, and SQL notebooks as automated jobs but does not provide any deeper support for these languages within Visual Studio Code.

Requirements

This following hands-on tutorial assumes:

Configure the VSCode extension

Follow these steps:

  1. Install the extension: on the Databricks extension for Visual Studio Code page in the Visual Studio Code Marketplace, click Install. To complete the installation, follow the on-screen instructions.

  2. Open the extension: On the sidebar, click the Databricks logo.

  3. Start configuring the extension: In the Configuration pane, click Configure Databricks.

  4. Set the Databricks workspace: In the Command Palette, for Databricks Host, enter your workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com. Then press Enter.

  5. Click the entry DEFAULT: Authenticate using the DEFAULT profile.

  6. Set the Databricks cluster: In the Configuration pane, click Cluster, and then click the gear (Configure cluster) icon.

  7. Click the entry for the cluster that you want to use.

  8. Start the cluster, if it is not already started: In the Configuration pane, next to Cluster, click the play (Start Cluster) icon.

  9. Set the sync destination: In the Configuration pane, click Sync Destination, and then click the gear (Configure cluster) icon.

  10. In the Command Palette, click the sync destination name that is randomly generated by the extension.

Create and run Python code

  1. Create a basic, local Python code file to sync and run: On the sidebar, click the Explorer logo.

  2. On the main menu, click File > New File. Name the file demo.py and save it to the project root.

  3. Add the following code to the file and then save it. This code creates and displays the contents of a basic PySpark DataFrame:

    from pyspark.sql import SparkSession
    from pyspark.sql.types import *
    
    spark = SparkSession.builder.getOrCreate()
    
    schema = StructType([
       StructField('CustomerID', IntegerType(), False),
       StructField('FirstName',  StringType(),  False),
       StructField('LastName',   StringType(),  False)
    ])
    
    data = [
       [ 1000, 'Mathijs', 'Oosterhout-Rijntjes' ],
       [ 1001, 'Joost',   'van Brunswijk' ],
       [ 1002, 'Stan',    'Bokenkamp' ]
    ]
    
    customers = spark.createDataFrame(data, schema)
    customers.show()
    
    # Output:
    #
    # +----------+---------+-------------------+
    # |CustomerID|FirstName|           LastName|
    # +----------+---------+-------------------+
    # |      1000|  Mathijs|Oosterhout-Rijntjes|
    # |      1001|    Joost|      van Brunswijk|
    # |      1002|     Stan|          Bokenkamp|
    # +----------+---------+-------------------+
    
  4. In the Configuration pane, next to Sync Destination, click the circled arrows (Start synchronization) icon.

  5. In the Explorer view, right-click the demo.py file, and then click Upload and Run File on Databricks. The output appears in the Debug Console pane.

Next steps

Now that you have successfully used the Databricks extension for Visual Studio Code to run a basic Python file, learn more about how to use the extension: