This article provides a guide to developing notebooks and jobs in Databricks using the Scala language. The first section provides links to tutorials for common workflows and tasks. The second section provides links to APIs, libraries, and key tools.
A basic workflow for getting started is:
Import code and run it using an interactive Databricks notebook: Either import your own code from files or Git repos or try a tutorial listed below.
Run your code on a cluster: Either create a cluster of your own or ensure that you have permissions to use a shared cluster. Attach your notebook to the cluster and run the notebook.
Beyond this, you can branch out into more specific topics:
Work with larger data sets using Apache Spark
Automate your workload as a job
The tutorials below provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.
Tutorial: Delta Lake provides Scala examples.
Quickstart Java and Scala helps you learn the basics of tracking machine learning training runs using MLflow in Scala.
Use XGBoost on Databricks provides a Scala example.
The below subsections list key features and tips to help you begin developing in Databricks with Scala.
These links provide an introduction to and reference for the Apache Spark Scala API.
Databricks notebooks support Scala. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.
To completely reset the state of your notebook, it can be useful to restart the kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and re-attaching a notebook in Databricks. To restart the kernel in a notebook, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select Detach & re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the process.
Databricks Repos allows users to synchronize notebooks and other files with Git repositories. Databricks Repos helps with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.
Databricks Compute provides compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.
For small workloads which only require single nodes, data scientists can use Single node compute for cost savings.
For detailed tips, see Best practices: Cluster configuration
Administrators can set up cluster policies to simplify and guide cluster creation.
Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom libraries to use with notebooks and jobs.
Start with the default libraries in the Databricks Runtime release notes versions and compatibility. For full lists of pre-installed libraries, see Databricks Runtime release notes versions and compatibility.
You can also install Scala libraries in a cluster.
For more details, see Libraries.
Databricks Scala notebooks have built-in support for many types of visualizations. You can also use legacy visualizations:
This section describes features that support interoperability between Scala and SQL.
You can automate Scala workloads as scheduled or triggered jobs in Databricks. Jobs can run notebooks and JARs.
In addition to developing Scala code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as IntelliJ IDEA. To synchronize work between external development environments and Databricks, there are several options:
Code: You can synchronize code using Git. See Git integration with Databricks Repos.
Libraries and jobs: You can create libraries externally and upload them to Databricks. Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See Libraries and Create and run Databricks Jobs.
Remote machine execution: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Databricks to execute large computations on Databricks clusters. For example, you can use IntelliJ IDEA with Databricks Connect.
Databricks provides a set of SDKs which support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the Databricks SDKs.
For more information on IDEs, developer tools, and SDKs, see Developer tools and guidance.
The Databricks Academy offers self-paced and instructor-led courses on many topics.