How to Correctly Update a Maven Library in Databricks

Problem

Let’s say you make a minor update to a library in the repository, and you don’t want to change the version number because it is just a small change for testing purposes. However, when you attach the library to your cluster again, your code changes are not included in the library.

Cause

One strength of Databricks is the ability to install third-party or custom libraries, such as from a Maven repository. However, when a library is updated in the repository, there is no automated way to update the corresponding library in the cluster.

When you request Databricks to download a library in order to attach it to a cluster, the following process occurs:

  1. In Databricks, you request a library from a Maven repository.
  2. Databricks checks the local cache for the library, and if it is not present, downloads the library from the Maven repository to a local cache.
  3. Databricks then copies the library to DBFS (/FileStore/jars/maven/).
  4. Upon subsequent requests for the library, Databricks uses the file that has already been copied to DBFS, and does not download a new copy.

Solution

To ensure that an updated version of a library (or a library that you have customized) is downloaded to a cluster, make sure to increment the build number or version number of the artifact in some way. For example, you can change libA_v1.0.0-SNAPSHOT to libA_v1.0.1-SNAPSHOT, and then the new library will download. You can then attach it to your cluster.