Databricks for Scala developers

This article provides a guide to developing notebooks and jobs in Databricks using the Scala language. The first section provides links to tutorials for common workflows and tasks. The second section provides links to APIs, libraries, and key tools.

A basic workflow for getting started is:

Beyond this, you can branch out into more specific topics:

Tutorials

The tutorials below provide example code and notebooks to learn about common workflows. See Import a notebook for instructions on importing notebook examples into your workspace.

Reference

The below subsections list key features and tips to help you begin developing in Databricks with Scala.

Manage code with notebooks and Databricks Git folders

Databricks notebooks support Scala. These notebooks provide functionality similar to that of Jupyter, but with additions such as built-in visualizations using big data, Apache Spark integrations for debugging and performance monitoring, and MLflow integrations for tracking machine learning experiments. Get started by importing a notebook. Once you have access to a cluster, you can attach a notebook to the cluster and run the notebook.

Tip

To reset the state of your notebook, restart the kernel. For Jupyter users, the “restart kernel” option in Jupyter corresponds to detaching and reattaching a notebook in Databricks. To restart the kernel in a notebook, click the compute selector in the notebook toolbar and hover over the attached cluster or SQL warehouse in the list to display a side menu. Select Detach & re-attach. This detaches the notebook from your cluster and reattaches it, which restarts the process.

Databricks Git folders allow users to synchronize notebooks and other files with Git repositories. Databricks Git folders help with code versioning and collaboration, and it can simplify importing a full repository of code into Databricks, viewing past notebook versions, and integrating with IDE development. Get started by cloning a remote Git repository. You can then open or create notebooks with the repository clone, attach the notebook to a cluster, and run the notebook.

Clusters and libraries

Databricks compute provides compute management for clusters of any size: from single node clusters up to large clusters. You can customize cluster hardware and libraries according to your needs. Data scientists generally begin work either by creating a cluster or using an existing shared cluster. Once you have access to a cluster, you can attach a notebook to the cluster or run a job on the cluster.

Databricks clusters use a Databricks Runtime, which provides many popular libraries out-of-the-box, including Apache Spark, Delta Lake, and more. You can also install additional third-party or custom libraries to use with notebooks and jobs.

Visualizations

Databricks Scala notebooks have built-in support for many types of visualizations. You can also use legacy visualizations:

Interoperability

This section describes features that support interoperability between Scala and SQL.

Jobs

You can automate Scala workloads as scheduled or triggered jobs in Databricks. Jobs can run notebooks and JARs.

IDEs, developer tools, and SDKs

In addition to developing Scala code within Databricks notebooks, you can develop externally using integrated development environments (IDEs) such as IntelliJ IDEA. To synchronize work between external development environments and Databricks, there are several options:

  • Code: You can synchronize code using Git. See Git integration for Databricks Git folders.

  • Libraries and jobs: You can create libraries externally and upload them to Databricks. Those libraries may be imported within Databricks notebooks, or they can be used to create jobs. See Libraries and Schedule and orchestrate workflows.

  • Remote machine execution: You can run code from your local IDE for interactive development and testing. The IDE can communicate with Databricks to execute large computations on Databricks clusters. For example, you can use IntelliJ IDEA with Databricks Connect.

Databricks provides a set of SDKs which support automation and integration with external tooling. You can use the Databricks SDKs to manage resources like clusters and libraries, code and other workspace objects, workloads and jobs, and more. See the Databricks SDKs.

For more information on IDEs, developer tools, and SDKs, see Developer tools.

Additional resources