Databricks Connect for Scala tutorial
Preview
This feature is in Public Preview.
This article demonstrates how to quickly get started with Databricks Connect by using Scala with IntelliJ IDEA and the Scala plugin. For the Python version of this tutorial, see the Databricks Connect for Python tutorial.
Databricks Connect enables you to connect popular IDEs such as IntelliJ IDEA, notebook servers, and other custom applications to Databricks clusters.
Note
This article covers Databricks Connect for Databricks Runtime 13.0 and above.
For information beyond this tutorial about Databricks Connect for Databricks Runtime 13.0 and above, see the Databricks Connect reference.
For information about Databricks Connect for prior Databricks Runtime versions, see Databricks Connect for Databricks Runtime 12.2 LTS and below.
Requirements
You have access to a Databricks workspace and its corresponding account that are enabled for Unity Catalog. See Get started using Unity Catalog and Enable a workspace for Unity Catalog.
You have a Databricks cluster in the workspace. The cluster has Databricks Runtime 13.3 LTS or above installed. The cluster also must use a cluster access mode of Single User or Shared. See Access modes.
Note
Scala is not supported on Databricks Connect for Databricks Runtime 13.2 and below.
You have the Java Development Kit (JDK) installed on your development machine. Databricks recommends that the version of your JDK installation that you use matches the JDK version on your Databricks cluster. The following table shows the JDK version for each supported Databricks Runtime.
Databricks Runtime version
JDK version
14.0 ML, 14.0
JDK 8
13.3 ML LTS, 13.3 LTS
JDK 8
Note
If you do not have a JDK installed, or if you have multiple JDK installs on your development machine, you can install or choose a specific JDK later in Step 1. Choosing a JDK install that is below or above the JDK version on your cluster might produce unexpected results, or your code might not run at all.
You have IntelliJ IDEA installed.
You have the Scala plugin for IntelliJ IDEA installed.
To complete this tutorial, follow these steps:
Step 1: Set up Databricks authentication
This tutorial uses Databricks personal access token authentication for authenticating with your Databricks workspace. If you already have a Databricks personal access token, skip ahead to Step 2.
To create a Databricks personal access token, do the following:
In your Databricks workspace, click your Databricks username in the top bar, and then select User Settings from the drop down.
Click Developer.
Next to Access tokens, click Manage.
Click Generate new token.
(Optional) Enter a comment that helps you to identify this token in the future, and change the token’s default lifetime of 90 days. To create a token with no lifetime (not recommended), leave the Lifetime (days) box empty (blank).
Click Generate.
Copy the displayed token to a secure location, and then click Done.
Be sure to save the copied token in a secure location. Do not share your copied token with others. If you lose the copied token, you cannot regenerate that exact same token. Instead, you must repeat this procedure to create a new token. If you lose the copied token, or you believe that the token has been compromised, Databricks strongly recommends that you immediately delete that token from your workspace by clicking the X next to the token on the Access tokens page.
Note
If you are not able to create or use tokens in your workspace, this might be because your workspace administrator has disabled tokens or has not given you permission to create or use tokens. See your workspace administrator or the following:
Step 2: Create the project
Start IntelliJ IDEA.
Click File > New > Project.
Give your project some meaningful Name.
For Location, click the folder icon, and complete the on-screen directions to specify the path to your new Scala project.
For Language, click Scala.
For Build system, click sbt.
In the JDK drop-down list, select an existing installation of the JDK on your development machine that matches the JDK version on your cluster, or select Download JDK and follow the on-screen instructions to download a JDK that matches the JDK version on your cluster.
Note
Choosing a JDK install that is above or below the JDK version on your cluster might produce unexpected results, or your code might not run at all.
In the sbt drop-down list, select the latest version.
In the Scala drop-down list, select the version of Scala that matches the Scala version on your cluster. The following table shows the Scala version for each supported Databricks Runtime:
Databricks Runtime version
Scala version
14.0 ML, 14.0
2.12.15
13.3 ML LTS, 13.3 LTS
2.12.15
Note
Choosing a Scala version that is below or above the Scala version on your cluster might produce unexpected results, or your code might not run at all.
Make sure the Download sources box next to Scala is checked.
For Package prefix, enter some package prefix value for your project’s sources, for example
org.example.application
.Make sure the Add sample code box is checked.
Click Create.
Step 3: Add the Databricks Connect package
With your new Scala project open, in your Project tool window (View > Tool Windows > Project), open the file named
build.sbt
, in project-name > target.Add the following code to the end of the
build.sbt
file, which declares your project’s dependency on a specific version of the Databricks Connect library for Scala:libraryDependencies += "com.databricks" % "databricks-connect" % "14.0.0"
Replace
14.0.0
with the version of the Databricks Connect library that matches the Databricks Runtime version on your cluster. You can find the Databricks Connect library version numbers in the Maven central repository.Click the Load sbt changes notification icon to update your Scala project with the new library location and dependency.
Wait until the
sbt
progress indicator at the bottom of the IDE disappears. Thesbt
load process might take a few minutes to complete.
Step 4: Add code
In your Project tool window, open the file named
Main.scala
, in project-name > src > main > scala.Replace any existing code in the file with the following code and then save the file:
package org.example.application import com.databricks.connect.DatabricksSession import org.apache.spark.sql.SparkSession object Main { def main(args: Array[String]): Unit = { val spark = DatabricksSession.builder().remote().getOrCreate() val df = spark.read.table("samples.nyctaxi.trips") df.limit(5).show() } }
Step 5: Add environment variables
In the Project tool window, right-click the
Main.scala
file and click Modify Run Configuration.For Environment variables, enter the following string:
DATABRICKS_HOST=<workspace-instance-name>;DATABRICKS_TOKEN=<personal-access-token>;DATABRICKS_CLUSTER_ID=<cluster-id>
In the preceding string, replace the following placeholders:
Replace
<workspace-instance-name>
with your workspace instance name, for exampledbc-a1b2345c-d6e7.cloud.databricks.com
.Replace
<personal-access-token>
with the value of the Databricks personal access token for your Databricks workspace user.Replace
<cluster-id>
with the value of your cluster’s ID. To get a cluster’s ID, see Cluster URL and ID.
Click OK.
Step 6: Run the code
Start the target cluster in your remote Databricks workspace.
After the cluster has started, on the main menu, click Run > Run ‘Main’.
In the Run tool window (View > Tool Windows > Run), on the Main tab, the first 5 rows of the
samples.nyctaxi.trips
table appear.
Step 7: Debug the code
With the target cluster still running, in the preceding code, click the gutter next to
df.limit(5).show()
to set a breakpoint.Click Run > Debug ‘Main’.
In the Debug tool window (View > Tool Windows > Debug), on the Console tab, click the calculator (Evaluate Expression) icon.
Enter the expression
df.schema
and click Evaluate to show the DataFrame’s schema.In the Debug tool window’s sidebar, click the green arrow (Resume Program) icon.
In the Console pane, the first 5 rows of the
samples.nyctaxi.trips
table appear.
Next steps
To learn more about Databricks Connect, see Databricks Connect reference. This reference article includes guidance for the following topics:
Supported Databricks authentication types in addition to Databricks personal access token authentication.
How to use IDEs in addition to IntelliJ IDEA such as Visual Studio Code.
Migrate from Databricks Connect for Databricks Runtime 12.2 LTS and below to Databricks Connect for Databricks Runtime 13.0 and above.
How to use Databricks Connect to access Databricks Utilities.
Provides troubleshooters.
Lists the limitations of Databricks Connect.