Databricks Utilities
Databricks Utilities (DBUtils) make it easy to perform powerful combinations of tasks. You can use the utilities to work with object storage efficiently, to chain and parameterize notebooks, and to work with secrets. DBUtils are not supported outside of notebooks.
All dbutils
utilities are available in Python, R, and Scala notebooks. File system utilities are not available in R notebooks; however, you can use a language magic command to invoke those dbutils
methods in R and SQL notebooks. For example, to list the Databricks datasets DBFS folder in an R or SQL notebook, run the command:
dbutils.fs.ls("/databricks-datasets")
Alternatively, you can use %fs
:
%fs ls /databricks-datasets
File system utilities
The file system utilities access Databricks File System (DBFS), making it easier to use Databricks as a file system. Learn more by running:
dbutils.fs.help()
cp(from: String, to: String, recurse: boolean = false): boolean -> Copies a file or directory, possibly across FileSystems
head(file: String, maxBytes: int = 65536): String -> Returns up to the first 'maxBytes' bytes of the given file as a String encoded in UTF-8
ls(dir: String): Seq -> Lists the contents of a directory
mkdirs(dir: String): boolean -> Creates the given directory if it does not exist, also creating any necessary parent directories
mv(from: String, to: String, recurse: boolean = false): boolean -> Moves a file or directory, possibly across FileSystems
put(file: String, contents: String, overwrite: boolean = false): boolean -> Writes the given String out to a file, encoded in UTF-8
rm(dir: String, recurse: boolean = false): boolean -> Removes a file or directory
mount(source: String, mountPoint: String, encryptionType: String = "", owner: String = null, extraConfigs: Map = Map.empty[String, String]): boolean -> Mounts the given source directory into DBFS at the given mount point
mounts: Seq -> Displays information about what is mounted within DBFS
refreshMounts: boolean -> Forces all machines in this cluster to refresh their mount cache, ensuring they receive the most recent information
unmount(mountPoint: String): boolean -> Deletes a DBFS mount point
dbutils.fs.ls Command
The sequence returned by the ls
command contains the following attributes:
Attribute | Type | Description |
---|---|---|
path | string | The path of the file or directory. |
name | string | The name of the file or directory. |
isDir() | boolean | True if the path is a directory. |
size | long/int64 | The length of the file in bytes or zero if the path is a directory. |
Note
You can get detailed information about each command by using help
, for example: dbutils.fs.help("ls")
Notebook workflow utilities
Notebook workflows allow you to chain together notebooks and act on their results. See Notebook workflows. Learn more by running:
dbutils.notebook.help()
exit(value: String): void -> This method lets you exit a notebook with a value
run(path: String, timeoutSeconds: int, arguments: Map): String -> This method runs a notebook and returns its exit value.
Note
The maximum length of the string value returned from run
is 5 MB. See Runs get output.
Note
You can get detailed information about each command by using help
, for example: dbutils.notebook.help("exit")
Widget utilities
Widgets allow you to parameterize notebooks. See Widgets. Learn more by running:
dbutils.widgets.help()
combobox(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a combobox input widget with a given name, default value and choices
dropdown(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a dropdown input widget a with given name, default value and choices
get(name: String): String -> Retrieves current value of an input widget
getArgument(name: String, optional: String): String -> (DEPRECATED) Equivalent to get
multiselect(name: String, defaultValue: String, choices: Seq, label: String): void -> Creates a multiselect input widget with a given name, default value and choices
remove(name: String): void -> Removes an input widget from the notebook
removeAll: void -> Removes all widgets in the notebook
text(name: String, defaultValue: String, label: String): void -> Creates a text input widget with a given name and default value
Note
You can get detailed information about each command by using help
, for example: dbutils.widgets.help("combobox")
Secrets utilities
Secrets allow you to store and access sensitive credential information without making them visible in notebooks. See Secret management and Use the secrets in a notebook. Learn more by running:
dbutils.secrets.help()
get(scope: String, key: String): String -> Gets the string representation of a secret value with scope and key
getBytes(scope: String, key: String): byte[] -> Gets the bytes representation of a secret value with scope and key
list(scope: String): Seq -> Lists secret metadata for secrets within a scope
listScopes: Seq -> Lists secret scopes
Note
You can get detailed information about each command by using help
, for example: dbutils.secrets.help("get")
Library utilities
Library utilities allow you to install Python libraries and create an environment scoped to a notebook session. The libraries are available both on the driver and on the executors, so you can reference them in UDFs. This enables:
- Library dependencies of a notebook to be organized within the notebook itself.
- Notebook users with different library dependencies to share a cluster without interference.
Detaching a notebook destroys this environment. However, you can recreate it by re-running the library install
API commands in the notebook. See the restartPython
API for how you can reset your notebook state without losing your environment.
Important
Library utilities are not available on Databricks Runtime ML or Databricks Runtime for Genomics. Instead, refer to Notebook-scoped Python libraries.
For Databricks Runtime 7.2 and above, Databricks recommends using %pip
magic commands to install notebook-scoped libraries. See Notebook-scoped Python libraries.
Library utilities are enabled by default. Therefore, by default the Python environment for each notebook is isolated by using a separate Python executable that is created when the notebook is attached to and inherits the default Python environment on the cluster. Libraries installed through an init script into the Databricks Python environment are still available. You can disable this feature by setting spark.databricks.libraryIsolation.enabled
to false
.
This API is compatible with the existing cluster-wide library installation through the UI and REST API. Libraries installed through this API have higher priority than cluster-wide libraries.
dbutils.library.help()
install(path: String): boolean -> Install the library within the current notebook session
installPyPI(pypiPackage: String, version: String = "", repo: String = "", extras: String = ""): boolean -> Install the PyPI library within the current notebook session
list: List -> List the isolated libraries added for the current notebook session via dbutils
restartPython: void -> Restart python process for the current notebook session
Note
You can get detailed information about each command by using help
, for example: dbutils.library.help("install")
Examples
Install a PyPI library in a notebook.
version
,repo
, andextras
are optional. Use theextras
argument to specify the Extras feature (extra requirements).dbutils.library.installPyPI("pypipackage", version="version", repo="repo", extras="extras") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
Important
The
version
andextras
keys cannot be part of the PyPI package string. For example:dbutils.library.installPyPI("azureml-sdk[databricks]==1.19.0")
is not valid. Use theversion
andextras
arguments to specify the version and extras information as follows:dbutils.library.installPyPI("azureml-sdk", version="1.19.0", extras="databricks") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
Note
When replacing
dbutils.library.installPyPI
commands with%pip
commands, the Python interpreter is automatically restarted. You can run the install command as follows:%pip install azureml-sdk[databricks]==1.19.0
Specify your library requirements in one notebook and install them through
%run
in the other.Define the libraries to install in a notebook called
InstallDependencies
.dbutils.library.installPyPI("torch") dbutils.library.installPyPI("scikit-learn", version="1.19.1") dbutils.library.installPyPI("azureml-sdk", extras="databricks") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
Install them in the notebook that needs those dependencies.
%run /path/to/InstallDependencies # Install the dependencies in first cell
import torch from sklearn.linear_model import LinearRegression import azureml # do the actual work
List the libraries installed in a notebook.
dbutils.library.list()
Note
The equivalent of this command using
%pip
is:%pip freeze
Reset the Python notebook state while maintaining the environment. This API is available only in Python notebooks. This can be used to:
Reload libraries Databricks preinstalled with a different version. For example:
dbutils.library.installPyPI("numpy", version="1.15.4") dbutils.library.restartPython()
# Make sure you start using the library in another cell. import numpy
Install libraries like tensorflow that need to be loaded on process start up. For example:
dbutils.library.installPyPI("tensorflow") dbutils.library.restartPython()
# Use the library in another cell. import tensorflow
Install a
.egg
or.whl
library in a notebook.Important
We recommend that you put all your library install commands in the first cell of your notebook and call
restartPython
at the end of that cell. The Python notebook state is reset after runningrestartPython
; the notebook loses all state including but not limited to local variables, imported libraries, and other ephemeral states. Therefore, we recommended that you install libraries and reset the notebook state in the first notebook cell.The accepted library sources are
dbfs
ands3
.dbutils.library.install("dbfs:/path/to/your/library.egg") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
dbutils.library.install("dbfs:/path/to/your/library.whl") dbutils.library.restartPython() # Removes Python state, but some libraries might not work without calling this function
Note
You can directly install custom wheel files using
%pip
. In the following example we are assuming you have uploaded your library wheel file to DBFS:%pip install /dbfs/path/to/your/library.whl
Egg files are not supported by pip, and wheel is considered the standard for build and binary packaging for Python. See Wheel vs Egg for more details. However, if you want to use an egg file in a way that’s compatible with
%pip
, you can use the following workaround:# This step is only needed if no %pip commands have been executed yet. # It will trigger setting up the isolated notebook environment %pip install <any-lib> # This doesn't need to be a real lib, eg - "%pip install foo" would work
import sys # Assuming the preceding step was completed, the following command # adds the egg file to the current notebook environment sys.path.append("/local/path/to/library.egg")
Credentials utilities
Credentials utilities allow you to interact with credentials within notebooks. Only usable on clusters with credential passthrough enabled.
dbutils.credentials.help()
assumeRole(role: String): boolean -> Sets the role ARN to assume when looking for credentials to authenticate with S3
showCurrentRole: List -> Shows the currently set role
showRoles: List -> Shows the set of possible assumed roles
List your roles in a notebook:
dbutils.credentials.showRoles()
To select a specific role, run
dbutils.credentials.assumeRole("arn:aws:iam::xxxxxxxx:role/<my-role>")
To show the currently selected role, run:
dbutils.credentials.showCurrentRole()
To read data with a role, run:
dbutils.credentials.assumeRole("arn:aws:iam::xxxxxxxx:role/bucketARole")
sc.textFile("s3a://bucketA/Filename")
Databricks Utilities API library
Note
You cannot use this library to run pre-deployment tests if you are deploying to a cluster running Databricks Runtime 7.0 or above, because there is no version of the library that supports Scala 2.12.
To accelerate application development, it can be helpful to compile, build, and test applications before you deploy them as production jobs. To enable you to compile against Databricks Utilities, Databricks provides the dbutils-api
library. You can download the dbutils-api library or include the library by adding a dependency to your build file:
SBT
libraryDependencies += "com.databricks" % "dbutils-api_2.11" % "0.0.4"
Maven
<dependency> <groupId>com.databricks</groupId> <artifactId>dbutils-api_2.11</artifactId> <version>0.0.4</version> </dependency>
Gradle
compile 'com.databricks:dbutils-api_2.11:0.0.4'
Once you build your application against this library, you can deploy the application.
Important
The dbutils-api
library allows you to locally compile an application that uses dbutils
, but not to run it. To run the application, you must deploy it in Databricks.
Express the artifact’s Scala version with %%
If you express the artifact version as groupID %% artifactID % revision
instead of groupID % artifactID % revision
(the difference is the double %%
after the groupID
), SBT will add your project’s Scala version to the artifact name.
Example
Suppose the scalaVersion
for your build is 2.9.1
. You could write the artifact version with %
as follows:
val appDependencies = Seq(
"org.scala-tools" % "scala-stm_2.9.1" % "0.3"
)
The following using %%
is identical:
val appDependencies = Seq(
"org.scala-tools" %% "scala-stm" % "0.3"
)
Example projects
Here is an example archive
containing minimal example projects that show you how to compile using the dbutils-api
library for 3 common build tools:
- sbt:
sbt package
- Maven:
mvn install
- Gradle:
gradle build
These commands create output JARs in the locations:
- sbt:
target/scala-2.11/dbutils-api-example_2.11-0.0.1-SNAPSHOT.jar
- Maven:
target/dbutils-api-example-0.0.1-SNAPSHOT.jar
- Gradle:
build/examples/dbutils-api-example-0.0.1-SNAPSHOT.jar
You can attach this JAR to your cluster as a library, restart the cluster, and then run:
example.Test()
This statement creates a text input widget with the label Hello: and the initial value World.
You can use all the other dbutils
APIs the same way.
To test an application that uses the dbutils
object outside Databricks, you can mock up the dbutils
object by calling:
com.databricks.dbutils_v1.DBUtilsHolder.dbutils0.set(
new com.databricks.dbutils_v1.DBUtilsV1{
...
}
)
Substitute your own DBUtilsV1
instance in which you implement the interface
methods however you like, for example providing a local filesystem mockup for dbutils.fs
.