Cluster Node Initialization Scripts

An init script is a shell script that runs during startup for each cluster node before the Spark driver or worker JVM starts. Some examples of tasks performed by init scripts include:

Important

To install Python packages, use the Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into Databricks Python virtual environment rather than the system Python environment. For example, /databricks/python/bin/pip install <packagename>.

  • Modify the JVM system classpath in special cases.
  • Set system properties and environment variables used by the JVM.
  • Modify Spark configuration parameters.

Init script types

Databricks supports three kinds of init scripts: global, cluster-named, and cluster-scoped. The order of execution of init scripts is:

  1. Global - runs on every cluster
  2. Cluster-named - runs on a cluster with the same name as the script
  3. Cluster-scoped - runs on every cluster that references the script

Global init scripts

Global init scripts are stored in dbfs:/databricks/init/ and run on every cluster.

Important

Any change to a global init script requires a cluster restart.

To delete a global init script, delete the init script file. You can perform this either in a notebook or using the DBFS API or DBFS CLI. For example:

dbutils.fs.rm("/databricks/init/my-echo.sh")

If you have created a global init script that is preventing new clusters from starting up, use the API or CLI to move or delete the script.

Cluster-named init scripts

Important

  • Cluster-named init scripts are deprecated. Databricks recommends that you use cluster-scoped init scripts.
  • You cannot use cluster-named init scripts for clusters that run jobs because job cluster names are generated on the fly. However, you can use cluster-scoped init scripts for job clusters.
  • Avoid spaces in cluster names since they’re used in the script and output paths.

Cluster-named scripts scope to a single cluster, specified by the cluster’s name. Cluster-named init scripts are stored in the directory dbfs:/databricks/init/<cluster-name>. For example, to specify init scripts for the cluster named PostgreSQL, create the directory dbfs:/databricks/init/PostgreSQL, and put all scripts that should run on cluster PostgreSQL in that directory.

Important

Any change to a cluster-named init script requires a cluster restart.

To delete a cluster-named init script, delete the init script file. You can perform this either in a notebook or using the DBFS API or DBFS CLI. For example:

dbutils.fs.rm("dbfs:/databricks/init/PostgreSQL/postgresql-install.sh")

Cluster-scoped init scripts

Cluster-scoped init scripts are init scripts defined in a cluster configuration. Cluster-scoped init scripts apply to both clusters you create and those created to run jobs. Since the scripts are part of the cluster configuration, cluster access control lets you control who can change the scripts.

You can configure cluster-scoped init scripts using the UI, the CLI, and by invoking the Clusters API. This section focuses on performing these tasks using the UI. For the other methods, see Databricks CLI and Clusters API.

In the cluster configuration page, you configure the init script location at the bottom in the Init Scripts tab. You can add any number of scripts and the scripts are executed sequentially in the order provided.

../../_images/init-scripts-aws.png

You can put init scripts in any DBFS or S3 path accessible by the cluster. To configure an init script:

  1. In the Destination drop-down, select a destination type.
  2. Specify a path to the init script.
  3. If the destination type is S3, select a region.
  4. Click Add.
  5. Upload your script to the specified location.

S3 bucket destinations

If you choose an S3 destination, you must configure the cluster with an IAM role that can access the bucket. This IAM role must have the getObjectAcl permission. An example IAM role has been included below for your convenience. See Secure Access to S3 Buckets Using IAM Roles for instructions on how to set up an IAM role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:getObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::<my-s3-bucket>/*"
            ]
        }
    ]
}

Important

If the script pointed to by the configuration doesn’t exist, the cluster will fail to be created or autoscaled up.

To remove a script from the cluster configuration, click the Delete Script Icon at the right of the script. When you confirm the delete you will be prompted to restart the cluster. Optionally you can delete the script file from the location you uploaded it to.

Environment variables

Init scripts support the following environment variables:

  • DB_CLUSTER_ID- the ID of the cluster on which the script is running
  • DB_CONTAINER_IP- the IP address of the LXC container in which Spark runs
  • DB_IS_DRIVER - indicates whether the script is running on a driver node
  • DB_DRIVER_IP - the IP address of the driver node
  • DB_INSTANCE_TYPE - the type of Spark node
  • DB_PYTHON_VERSION - the version of Python used on the cluster
  • DB_IS_JOB_CLUSTER - indicates whether the cluster was created to run a job

For example, if you want to run a script only on a driver node, you could write a script like:

echo $DB_IS_DRIVER
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  <run this part only on driver>
else
  <run this part only on workers>
fi
<run this part on both driver and workers>

Cluster-scoped init script events

Init scripts report start and finish events in the cluster event log. Therefore, you can compute the time it takes for an init script to run by subtracting the start from the finish event timestamps.

Init script logs

Init scripts write logs that can be used to debug issues with the scripts.

Global and cluster-named init scripts

Databricks saves all init script output for global and cluster-named init scripts to a file in DBFS named as follows: dbfs:/databricks/init/output/<cluster-name>/<date-timestamp>/<script-name>_<node-ip>.log. For example, if a cluster PostgreSQL has two Spark nodes with IPs 10.0.0.1 and 10.0.0.2, and the init script directory has a script called installpostgres.sh, there will be two output files at the following paths:

  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.1.log
  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.2.log

Cluster-scoped init scripts

By default, local init script logs inside the LXC container are stored in /databricks/init_scripts.

If cluster log delivery is configured, logs are delivered to that location. For each container, they will appear in a subdirectory called init_scripts/<container_name>. For example, if cluster logs are delivered to dbfs:/cluster-logs, the directory would be: dbfs:/cluster-logs/init_scripts/<container_name>.

Examples

Create a global init script

Warning

Global init scripts run on every cluster at cluster startup. Be careful about what you place in these init scripts.

  1. Create dbfs:/databricks/init/ if it doesn’t exist.

    dbutils.fs.mkdirs("dbfs:/databricks/init/")
    
  2. Display the list of existing global init scripts.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    
  3. Create a script that simply appends to a file.

    dbutils.fs.put("dbfs:/databricks/init/my-echo.sh" ,"""
    #!/bin/bash
    
    echo "hello" >> /hello.txt
    """, True)
    
  4. Check that the script exists.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    

Every time a cluster launches it runs this append script.

Create a cluster-scoped init script

This example creates an init script that installs a PostgreSQL JDBC driver on a cluster with ID 1202-211320-brick1.

  1. Create the base directory you want to store the init script in if it does not exist. Here we use dbfs:/databricks/<directory> as an example.

    dbutils.fs.mkdirs("dbfs:/databricks/<directory>/")
    
  2. Create the script.

    dbutils.fs.put("/databricks/<directory>/postgresql-install.sh","""
    #!/bin/bash
    wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar
    wget --quiet -O /mnt/jars/driver-daemon/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
    
  3. Check that the script exists.

    display(dbutils.fs.ls("dbfs:/databricks/<directory>/postgresql-install.sh"))
    
  4. Configure the cluster to run the script.

    curl -n -X POST -H 'Content-Type: application/json' -d '{
      "cluster_id": "1202-211320-brick1",
      "cluster_log_conf": {
        "dbfs" : {
          "destination": "dbfs:/cluster-logs"
        }
      },
      "init_scripts": [ {
        "dbfs": {
          "destination": "dbfs:/databricks/<directory>/postgresql-install.sh"
        }
      } ]
    }' https://<databricks-instance>/api/2.0/clusters/edit
    

Create a cluster-named init script

This example creates an init script for a cluster named PostgreSQL that installs the PostgreSQL JDBC driver on that cluster. You can create a customizable command if you create a variable clusterName that holds the cluster name.

  1. Create dbfs:/databricks/init/ if it doesn’t exist.

    dbutils.fs.mkdirs("dbfs:/databricks/init/")
    
  2. Display the list of existing global init scripts.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    
  3. Configure a cluster name variable.

    clusterName = "PostgreSQL"
    
  4. Create a directory named PostgreSQL using Databricks File System - DBFS.

    dbutils.fs.mkdirs("dbfs:/databricks/init/%s/"%clusterName)
    
  5. Create the script.

    dbutils.fs.put("/databricks/init/PostgreSQL/postgresql-install.sh","""
    #!/bin/bash
    wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar
    wget --quiet -O /mnt/jars/driver-daemon/postgresql-42.2.2.jar http://central.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
    
  6. Check that the cluster-specific init script exists.

    display(dbutils.fs.ls("dbfs:/databricks/init/%s/postgresql-install.sh"%clusterName))