Databricks Connect for Scala

Note

This article covers Databricks Connect for Databricks Runtime 13.3 LTS and above.

Databricks Connect for Scala is in Public Preview.

This article demonstrates how to quickly get started with Databricks Connect by using Scala with IntelliJ IDEA and the Scala plugin. For the Python version of this article, see Databricks Connect for Python.

Databricks Connect enables you to connect popular IDEs such as IntelliJ IDEA, notebook servers, and other custom applications to Databricks clusters. See What is Databricks Connect?.

Tutorial

To skip this tutorial and use a different IDE instead, see Next steps.

Requirements

To complete this tutorial, you must meet the following requirements:

  • Your target Databricks workspace and cluster must meet the requirements for Cluster configuration for Databricks Connect.

  • You must have your cluster ID available. To get your cluster ID, in your workspace, click Compute on the sidebar. In your web browser’s address bar, copy the string of characters between clusters and configuration in the URL.

  • You have the Java Development Kit (JDK) installed on your development machine. Databricks recommends that the version of your JDK installation that you use matches the JDK version on your Databricks cluster. The following table shows the JDK version for each supported Databricks Runtime.

    Databricks Runtime version

    JDK version

    14.0 ML, 14.0

    JDK 8

    13.3 ML LTS, 13.3 LTS

    JDK 8

    Note

    If you do not have a JDK installed, or if you have multiple JDK installs on your development machine, you can install or choose a specific JDK later in Step 1. Choosing a JDK install that is below or above the JDK version on your cluster might produce unexpected results, or your code might not run at all.

  • You have IntelliJ IDEA installed.

  • You have the Scala plugin for IntelliJ IDEA installed.

Step 1: Set up Databricks authentication

This tutorial uses Databricks personal access token authentication and a Databricks configuration profile for authenticating with your Databricks workspace.

If you already have a Databricks personal access token and a matching Databricks configuration profile, skip ahead to Step 3. If you are not sure whether you already have a Databricks personal access token, you can follow this Step without affecting any other Databricks personal access tokens in your user account.

Note

Databricks Connect supports OAuth authentication in addition to Databricks personal access token authentication. For OAuth authentication setup and configuration details, see Set up the client.

Databricks Connect also supports basic authentication. However, Databricks does not recommend basic authentication in production.

To create a Databricks personal access token, do the following:

  1. In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.

  2. Click Developer.

  3. Next to Access tokens, click Manage.

  4. Click Generate new token.

  5. (Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).

  6. Click Generate.

  7. Copy the displayed token to a secure location, and then click Done.

Note

Be sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the trash can (Revoke) icon next to the token on the Access tokens page.

If you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following:

Step 2: Create an authentication configuration profile

Create a Databricks authentication configuration profile to store necessary information about your personal access token on your local machine. Databricks developer tools and SDKs can use this configuration profile to quickly authenticate with your Databricks workspace.

To create a profile, do the following:

Note

The following procedure uses the Databricks CLI to create a Databricks configuration profile with the name DEFAULT. If you already have a DEFAULT configuration profile, this procedure overwrites your existing DEFAULT configuration profile.

To check whether you already have a DEFAULT configuration profile, and to view this profile’s settings if it exists, use the Databricks CLI to run the command databricks auth env --profile DEFAULT.

To create a configuration profile with a name other than DEFAULT, replace the DEFAULT part of --profile DEFAULT in the databricks configure command as shown in the following step with a different name for the configuration profile.

  1. Use the Databricks CLI to create a Databricks configuration profile named DEFAULT that uses Databricks personal access token authentication. To do this, run the following command:

    databricks configure --configure-cluster --profile DEFAULT
    
  2. For the prompt Databricks Host, enter your Databricks workspace instance URL, for example https://dbc-a1b2345c-d6e7.cloud.databricks.com.

  3. For the prompt Personal Access Token, enter the Databricks personal access token for your workspace.

  4. In the list of available clusters that appears, use your up arrow and down arrow keys to select the target Databricks cluster in your workspace, and then press Enter. You can also type any part of the cluster’s display name to filter the list of available clusters.

Step 3: Create the project

  1. Start IntelliJ IDEA.

  2. On the main menu, click File > New > Project.

  3. Give your project some meaningful Name.

  4. For Location, click the folder icon, and complete the on-screen directions to specify the path to your new Scala project.

  5. For Language, click Scala.

  6. For Build system, click sbt.

  7. In the JDK drop-down list, select an existing installation of the JDK on your development machine that matches the JDK version on your cluster, or select Download JDK and follow the on-screen instructions to download a JDK that matches the JDK version on your cluster.

    Note

    Choosing a JDK install that is above or below the JDK version on your cluster might produce unexpected results, or your code might not run at all.

  8. In the sbt drop-down list, select the latest version.

  9. In the Scala drop-down list, select the version of Scala that matches the Scala version on your cluster. The following table shows the Scala version for each supported Databricks Runtime:

    Databricks Runtime version

    Scala version

    14.0 ML, 14.0

    2.12.15

    13.3 ML LTS, 13.3 LTS

    2.12.15

    Note

    Choosing a Scala version that is below or above the Scala version on your cluster might produce unexpected results, or your code might not run at all.

  10. Make sure the Download sources box next to Scala is checked.

  11. For Package prefix, enter some package prefix value for your project’s sources, for example org.example.application.

  12. Make sure the Add sample code box is checked.

  13. Click Create.

Create the IntelliJ IDEA project

Step 4: Add the Databricks Connect package

  1. With your new Scala project open, in your Project tool window (View > Tool Windows > Project), open the file named build.sbt, in project-name > target.

  2. Add the following code to the end of the build.sbt file, which declares your project’s dependency on a specific version of the Databricks Connect library for Scala:

    libraryDependencies += "com.databricks" % "databricks-connect" % "13.3.0"
    

    Replace 13.3.0 with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.

  3. Click the Load sbt changes notification icon to update your Scala project with the new library location and dependency.

  4. Wait until the sbt progress indicator at the bottom of the IDE disappears. The sbt load process might take a few minutes to complete.

Install the Databricks Connect package

Step 5: Add code

  1. In your Project tool window, open the file named Main.scala, in project-name > src > main > scala.

  2. Replace any existing code in the file with the following code and then save the file:

    package org.example.application
    
    import com.databricks.connect.DatabricksSession
    import org.apache.spark.sql.SparkSession
    
    object Main {
      def main(args: Array[String]): Unit = {
        val spark = DatabricksSession.builder().remote().getOrCreate()
        val df = spark.read.table("samples.nyctaxi.trips")
        df.limit(5).show()
      }
    }
    

Step 6: Run the code

  1. Start the target cluster in your remote Databricks workspace.

  2. After the cluster has started, on the main menu, click Run > Run ‘Main’.

  3. In the Run tool window (View > Tool Windows > Run), on the Main tab, the first 5 rows of the samples.nyctaxi.trips table appear.

Step 7: Debug the code

  1. With the target cluster still running, in the preceding code, click the gutter next to df.limit(5).show() to set a breakpoint.

  2. On the main menu, click Run > Debug ‘Main’.

  3. In the Debug tool window (View > Tool Windows > Debug), on the Console tab, click the calculator (Evaluate Expression) icon.

  4. Enter the expression df.schema and click Evaluate to show the DataFrame’s schema.

  5. In the Debug tool window’s sidebar, click the green arrow (Resume Program) icon.

  6. In the Console pane, the first 5 rows of the samples.nyctaxi.trips table appear.

Debug the IntelliJ IDEA project

Next steps

To learn more about Databricks Connect, see articles such as the following: