Init Scripts

An init script is a script that runs on Spark cluster nodes during startup. This allows for significant customization of clusters.

About Init Scripts

At a high level, init scripts are shell scripts that run during setup for each cluster node before the Spark Driver/Worker JVM starts. Some examples include:

  • Modify the JVM system classpath in special cases such as using JDBC Drivers
  • Set system properties and environment variables used by the JVM
  • Modify the default Spark conf parameters explicitly

Init scripts apply to both manually created clusters as well as clusters created by jobs. Create the script once and it will run at cluster launch time.

Types of Init Scripts

There are two kinds of init scripts in Databricks. They are both created and managed from Databricks File System - DBFS.

Global

Global init scripts that run on every cluster at launch time in your account. You can find these scripts in dbfs:/databricks/init/.

Cluster Specific

Cluster-specific scripts scope to a single cluster, specified by the cluster’s name. They reside in a sub-directory of the init scripts directory where the sub-directory name and the cluster name are the same.

For example, to specify init scripts only for the cluster PostgreSQL, create a folder at dbfs:/databricks/init/PostgreSQL and copy all shell scripts that should run on cluster PostgreSQL.

Init Script Output

Databricks saves all init script output to a file in DBFS that looks like dbfs:/databricks/init/output/<CLUSTER_NAME>/<DATE_TIMESTAMP>/<SCRIPT_NAME>_<NODE_IP>.log. For example, if a cluster PostgreSQL has two Spark nodes with IPs 10.0.0.1 and 10.0.0.2, and the init script directory has a script called installpostgres.sh, then there will be two output files at the following paths:

  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.1.log
  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.2.log

Important Notes

It is important to note that:

  • Any changes to the scripts will require a cluster restart
  • Explicitly remove the script to deactivate it upon the next restart
  • Note: It is also best to avoid spaces in your cluster names since they’re used in the script and output paths above

Setting Up Your First Init Script

Create dbfs:/databricks/init/ if it doesn’t exist.

dbutils.fs.mkdirs("dbfs:/databricks/init/")

Display the list of existing global init scripts if they exist.

display(dbutils.fs.ls("dbfs:/databricks/init/"))

Creating a Cluster Specific Init Script

Now we’re going to create a new init script for a PostgreSQL cluster that will install PostgreSQL on that cluster. The cluster name we will expect will be PostgreSQL. In order to write this file we’re going to use Databricks File System - DBFS.

dbutils.fs.mkdirs("dbfs:/databricks/init/PostgreSQL/")

Now we create the script.

dbutils.fs.put("/databricks/init/PostgreSQL/postgresql-install.sh","""
#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-9.3-1101-jdbc4.jar http://central.maven.org/maven2/org/postgresql/postgresql/9.3-1101-jdbc4/postgresql-9.3-1101-jdbc4.jar
wget --quiet -O /mnt/jars/driver-daemon/postgresql-9.3-1101-jdbc4.jar http://central.maven.org/maven2/org/postgresql/postgresql/9.3-1101-jdbc4/postgresql-9.3-1101-jdbc4.jar
""", True)

Now let’s confirm that we created the global init script.

display(dbutils.fs.ls("dbfs:/databricks/init/PostgreSQL/postgresql-install.sh"))
clusterName = "YOUR_CLUSTER_NAME"

Creating a Global Init Script

Warning

Global init scripts run on every cluster at cluster launch time. Be careful about what you place in these init scripts.

Now let’s create a script that simply appends to a file on our hard drive.

dbutils.fs.put("dbfs:/databricks/init/my-echo.sh" ,"""
#!/bin/bash

echo "hello" >> /hello.txt
""", True)

Now let’s ensure that it exists.

display(dbutils.fs.ls("dbfs:/databricks/init/"))

Now every time a cluster runs this will execute this code.

Deleting Init Scripts

Deleting init scripts is simple. Simply delete the file that you created in the above process. You can perform this either via a notebook or via the DBFS API. If you have created a global init script that is preventing new clusters from starting up, please use the DBFS API to move or delete the init-script.

dbutils.fs.rm("/databricks/init/my-echo.sh")
dbutils.fs.rm("dbfs:/databricks/init/PostgreSQL/postgresql-install.sh")