An init script is a script that runs on Spark cluster nodes during startup. This allows for significant customization of clusters.
About Init Scripts¶
At a high level, init scripts are shell scripts that run during setup for each cluster node before the Spark Driver/Worker JVM starts. Some examples include:
- Modify the JVM system classpath in special cases such as using JDBC Drivers
- Set system properties and environment variables used by the JVM
- Modify the default Spark conf parameters explicitly
Init scripts apply to both manually created clusters as well as clusters created by jobs. Create the script once and it will run at cluster launch time.
Types of Init Scripts¶
There are two kinds of init scripts in Databricks. They are both created and managed from Databricks File System - DBFS.
Global init scripts that run on every cluster at launch time in your account. You can find these scripts in
Cluster-specific scripts scope to a single cluster, specified by the cluster’s name. They reside in a sub-directory of the init scripts directory where the sub-directory name and the cluster name are the same.
For example, to specify init scripts only for the cluster
PostgreSQL, create a folder at
dbfs:/databricks/init/PostgreSQL and copy all shell scripts that should run on cluster
Init Script Output¶
Databricks saves all init script output to a file in DBFS that looks like
For example, if a cluster
PostgreSQL has two Spark nodes with IPs
10.0.0.2, and the init script directory has a script called
installpostgres.sh, then there will be two output files at the following paths:
It is important to note that:
- Any changes to the scripts will require a cluster restart
- Explicitly remove the script to deactivate it upon the next restart
- Note: It is also best to avoid spaces in your cluster names since they’re used in the script and output paths above
Setting Up Your First Init Script¶
dbfs:/databricks/init/ if it doesn’t exist.
Display the list of existing global init scripts if they exist.
Creating a Cluster Specific Init Script¶
Now we’re going to create a new init script for a
PostgreSQL cluster that will install PostgreSQL on that cluster. The cluster name we will expect will be
PostgreSQL. In order to write this file we’re going to use Databricks File System - DBFS.
Now we create the script.
dbutils.fs.put("/databricks/init/PostgreSQL/postgresql-install.sh",""" #!/bin/bash wget --quiet -O /mnt/driver-daemon/jars/postgresql-9.3-1101-jdbc4.jar http://central.maven.org/maven2/org/postgresql/postgresql/9.3-1101-jdbc4/postgresql-9.3-1101-jdbc4.jar wget --quiet -O /mnt/jars/driver-daemon/postgresql-9.3-1101-jdbc4.jar http://central.maven.org/maven2/org/postgresql/postgresql/9.3-1101-jdbc4/postgresql-9.3-1101-jdbc4.jar """, True)
Now let’s confirm that we created the global init script.
clusterName = "YOUR_CLUSTER_NAME"
Creating a Global Init Script¶
Global init scripts run on every cluster at cluster launch time. Be careful about what you place in these init scripts.
Now let’s create a script that simply appends to a file on our hard drive.
dbutils.fs.put("dbfs:/databricks/init/my-echo.sh" ,""" #!/bin/bash echo "hello" >> /hello.txt """, True)
Now let’s ensure that it exists.
Now every time a cluster runs this will execute this code.
Deleting Init Scripts¶
Deleting init scripts is simple. Simply delete the file that you created in the above process. You can perform this either via a notebook or via the DBFS API. If you have created a global init script that is preventing new clusters from starting up, please use the DBFS API to move or delete the init-script.