Cluster node initialization scripts

An init script is a shell script that runs during startup of each cluster node before the Apache Spark driver or worker JVM starts.

Some examples of tasks performed by init scripts include:

  • Install packages and libraries not included in Databricks Runtime. To install Python packages, use the Databricks pip binary located at /databricks/python/bin/pip to ensure that Python packages install into the Databricks Python virtual environment rather than the system Python environment. For example, /databricks/python/bin/pip install <package-name>.
  • Modify the JVM system classpath in special cases.
  • Set system properties and environment variables used by the JVM.
  • Modify Spark configuration parameters.

Init script types

Databricks supports two kinds of init scripts: cluster-scoped and global.

  • Cluster-scoped: run on every cluster configured with the script. This is the recommended way to run an init script.

  • Global (Public Preview): run on every cluster in the workspace. They can help you to enforce consistent cluster configurations across your workspace. Use them carefully because they can cause unanticipated impacts, like library conflicts. Only admin users can create global init scripts.

    Note

    Databricks recently improved the behavior of global init scripts to work in a safer, more visible, and more secure manner. You should migrate existing “legacy” global init scripts to the new global init script framework. See Migrate from legacy to new global init scripts.

In addition, there are two kinds of init scripts that are deprecated. Databricks recommends that you migrate init scripts of these types to those listed above:

  • Cluster-named: run on a cluster with the same name as the script. Cluster-named init scripts are best-effort (silently ignore failures), and attempt to continue the cluster launch process. Cluster-scoped init scripts should be used instead and are a complete replacement.
  • Legacy global: run on every cluster. They are less secure than the new global init script framework, silently ignore failures, and cannot reference environment variables. The new global init script framework should be used instead.

Whenever you change any type of init script you must restart all clusters affected by the script.

Init script execution order

The order of execution of init scripts is:

  1. Legacy global
  2. Cluster-named
  3. Global (new)
  4. Cluster-scoped

Environment variables

Cluster-scoped and global init scripts (new generation) support the following environment variables:

  • DB_CLUSTER_ID: the ID of the cluster on which the script is running. See Clusters API.
  • DB_CONTAINER_IP: the private IP address of the container in which Spark runs. The init script is run inside this container. See SparkNode.
  • DB_IS_DRIVER: whether the script is running on a driver node.
  • DB_DRIVER_IP: the IP address of the driver node.
  • DB_INSTANCE_TYPE: the instance type of the host VM.
  • DB_CLUSTER_NAME: the name of the cluster the script is executing on.
  • DB_PYTHON_VERSION: the version of Python used on the cluster. See Python version.
  • DB_IS_JOB_CLUSTER: whether the cluster was created to run a job. See Create a job.
  • SPARKPASSWORD: a path to a secret.

For example, if you want to run part of a script only on a driver node, you could write a script like:

echo $DB_IS_DRIVER
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
  <run this part only on driver>
else
  <run this part only on workers>
fi
<run this part on both driver and workers>

Logging

Init script start and finish events are captured in cluster event logs. Details are captured in cluster logs. Global init script create, edit, and delete events are also captured in account-level audit logs.

Init script events

Cluster event logs capture two init script events: INIT_SCRIPTS_STARTED and INIT_SCRIPTS_FINISHED, indicating which scripts are scheduled for execution and which have completed successfully. INIT_SCRIPTS_FINISHED also captures execution duration.

Global init scripts are indicated in the log event details by the key "global" and cluster-scoped init scripts are indicated by the key "cluster".

Note

Cluster event logs do not log init script events for each cluster node; only one node is selected to represent them all.

Init script logs

If cluster log delivery is configured for a cluster, logs are delivered to /databricks/init_scripts. For each container, logs appear in a subdirectory called init_scripts/<cluster_id>_<container_ip>. For example, if cluster logs are delivered to dbfs:/cluster-logs, the directory would be: dbfs:/cluster-logs/init_scripts/<cluster_id>_<container_ip>. For example:

dbfs ls dbfs:/cluster-logs/1001-234039-abcde739/init_scripts
1001-234039-abcde739_10_97_225_166
1001-234039-abcde739_10_97_231_88
1001-234039-abcde739_10_97_244_199

If the logs are delivered to DBFS, you can view the logs using File system utilities. Otherwise, you can use the following code in a notebook to view the logs:

%sh
ls /databricks/init_scripts/

Every time a cluster launches, it writes a log to the init script log folder.

Important

Any user who creates a cluster and enables cluster log delivery can view the stderr and stdout output from global init scripts. You should ensure that your global init scripts do not output any sensitive information.

Audit logs

Databricks audit logs capture global init script create, edit, and delete events under the event type globalInitScripts. See Configure audit logging.

Cluster-scoped init scripts

Cluster-scoped init scripts are init scripts defined in a cluster configuration. Cluster-scoped init scripts apply to both clusters you create and those created to run jobs. Since the scripts are part of the cluster configuration, cluster access control lets you control who can change the scripts.

You can configure cluster-scoped init scripts using the UI, the CLI, and by invoking the Clusters API. This section focuses on performing these tasks using the UI. For the other methods, see Databricks CLI and Clusters API.

You can add any number of scripts, and the scripts are executed sequentially in the order provided.

If a cluster-scoped init script returns a non-zero exit code, the cluster launch fails. You can troubleshoot cluster-scoped init scripts by configuring cluster log delivery and examining the init script log.

Cluster-scoped init script locations

You can put init scripts in a DBFS or S3 directory accessible by a cluster. Cluster-node init scripts in DBFS must be stored in the DBFS root. Databricks does not support storing init scripts in a DBFS directory created by mounting object storage.

Example cluster-scoped init scripts

This section shows two examples of init scripts.

Example: Install PostgreSQL JDBC driver

The following snippets run in a Python notebook create an init script that installs a PostgreSQL JDBC driver.

  1. Create a DBFS directory you want to store the init script in. This example uses dbfs:/databricks/scripts.

    dbutils.fs.mkdirs("dbfs:/databricks/scripts/")
    
  2. Create a script named postgresql-install.sh in that directory:

    dbutils.fs.put("/databricks/scripts/postgresql-install.sh","""
    #!/bin/bash
    wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
    
  3. Check that the script exists.

    display(dbutils.fs.ls("dbfs:/databricks/scripts/postgresql-install.sh"))
    

Alternatively you can create the init script postgresql-install.sh locally:

#!/bin/bash
wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar

and copy it to dbfs:/databricks/scripts using DBFS CLI:

dbfs cp postgresql-install.sh dbfs:/databricks/scripts/postgresql-install.sh

Example: Use conda to install Python libraries

In Databricks Runtime ML, you use the Conda package manager to install Python packages. To install a Python library at cluster initialization, you can use a script like the following:

#!/bin/bash
set -ex
/databricks/python/bin/python -V
. /databricks/conda/etc/profile.d/conda.sh
conda activate /databricks/python
conda install -y astropy

Note

For other ways to install Python packages on a cluster, see Libraries.

Configure a cluster-scoped init script

You can configure a cluster to run an init script using the UI or API.

Important

  • The script must exist at the configured location. If the script doesn’t exist, the cluster will fail to start or be autoscaled up.
  • The init script cannot be larger than 64KB. If a script exceeds that size, the cluster will fail to launch and a failure message will appear in the cluster log.

Configure a cluster-scoped init script using the UI

To use the cluster configuration page to configure a cluster to run an init script:

  1. On the cluster configuration page, click the Advanced Options toggle.

  2. At the bottom of the page, click the Init Scripts tab.

    Init Scripts tab
  3. In the Destination drop-down, select a destination type. In the example in the preceding section, the destination is DBFS.

  4. Specify a path to the init script. In the example in the preceding section, the path is dbfs:/databricks/scripts/postgresql-install.sh.

  5. If the destination type is S3:

    1. Select a region.

    2. Ensure that the cluster is configured with an instance profile that has the getObjectAcl permission for access to the bucket. For example:

      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "s3:getObjectAcl"
            ],
            "Resource": [
              "arn:aws:s3:::<my-s3-bucket>/*"
            ]
          }
        ]
      }
      
  6. Click Add.

To remove a script from the cluster configuration, click the Delete Icon at the right of the script. When you confirm the delete you will be prompted to restart the cluster. Optionally you can delete the script file from the location you uploaded it to.

Configure a cluster-scoped init script using the DBFS REST API

To use the Clusters API to configure the cluster with ID 1202-211320-brick1 to run the init script in the preceding section, run the following command:

curl -n -X POST -H 'Content-Type: application/json' -d '{
  "cluster_id": "1202-211320-brick1",
  "num_workers": 1,
  "spark_version": "7.3.x-scala2.12",
  "node_type_id": "i3.2xlarge",
  "cluster_log_conf": {
    "dbfs" : {
      "destination": "dbfs:/cluster-logs"
    }
  },
  "init_scripts": [ {
    "dbfs": {
      "destination": "dbfs:/databricks/scripts/postgresql-install.sh"
    }
  } ]
}' https://<databricks-instance>/api/2.0/clusters/edit

Global init scripts (new)

Preview

The new global init script framework is in Public Preview.

A global init script runs on every cluster created in your workspace. Global init scripts are useful when you want to enforce organization-wide library configurations or security screens. Only admins can create global init scripts. You can create them using either the UI or REST API.

Important

Use global init scripts carefully. It is easy to add libraries or make other modifications that cause unanticipated impacts. Whenever possible, use cluster-scoped init scripts instead.

You can troubleshoot global init scripts by configuring cluster log delivery and examining the init script log.

Important

Any user who creates a cluster and enables cluster log delivery can view the stderr and stdout output from global init scripts. You should ensure that your global init scripts do not output any sensitive information.

Add a global init script using the UI

To configure global init scripts using the Admin Console:

  1. Go to the Admin Console and click the Global Init Scripts tab.

    Global Init Scripts tab
  2. Click the + Add button.

  3. Name the script and enter it by typing, pasting, or dragging a text file into the Script field.

    Add global init script

    Note

    The init script cannot be larger than 64KB. If a script exceeds that size, an error message will appear when you try to save.

  4. If you have more than one global init script configured for your workspace, set the order in which the new script will run.

  5. If you want the script to be enabled for all new and restarted clusters after you save, toggle on the Enabled switch.

    Important

    You must restart running clusters for changes to global init scripts to take effect, including changes to run order, name, and enablement state.

  6. Click Add.

Edit a global init script using the UI

  1. Go to the Admin Console and click the Global Init Scripts tab.
  2. Click a script.
  3. Edit the script.
  4. Click Confirm.

Configure a global init script using the API

Admins can add, delete, re-order, and get information about the global init scripts in your workspace using the Global Init Scripts API.

Migrate from legacy to new global init scripts

If your Databricks workspace was launched before August 2020, you might still have legacy global init scripts. You should migrate these to the new global init script framework to take advantage of the security, consistency, and visibility features included in the new script framework.

  1. Copy your existing legacy global init scripts and add them to the new global init script framework using either the UI or the REST API.

    Keep them disabled until you have completed the next step.

  2. Disable all legacy global init scripts.

    In the Admin Console, go to the Global Init Scripts tab and toggle off the Legacy Global Init Scripts switch.

    Disable legacy global init scripts
  3. Enable your new global init scripts.

    On the Global Init Scripts tab, toggle on the Enabled switch for each init script you want to enable.

    Enable global scripts
  4. Restart all clusters.

    • Legacy scripts will not run on new nodes added during automated scale-up of running clusters. Nor will new global init scripts run on those new nodes. You must restart all clusters to ensure that the new scripts run on them and that no existing clusters attempt to add new nodes with no global scripts running on them at all.
    • Non-idempotent scripts may need to be modified when you migrate to the new global init script framework and disable legacy scripts.

Legacy global init scripts (deprecated)

A global init script runs on every cluster created in your workspace.

Important

Legacy global init scripts are deprecated in favor of the new global init script framework, which is more secure, provides visibility into failures, and can reference cluster-related environment variables. You should migrate existing legacy global init scripts to the new framework to take advantage of these improvements.

A legacy global init script must be stored in dbfs:/databricks/init/.

Important

  • Use global init scripts carefully. It is easy to add libraries or make other modifications that cause unanticipated impacts. Whenever possible, use cluster-scoped init scripts instead.
  • If there is more than one legacy global init script, the order of execution is undetermined and depends on the order that the DBFS or S3 client returns the scripts.

To delete a legacy global init script, delete the init script file. You can perform this in a notebook, using the DBFS API, or using the DBFS CLI. For example:

dbutils.fs.rm("/databricks/init/my-echo.sh")

If you have created a legacy global init script that is preventing new clusters from starting up, use the API or CLI to move or delete the script.

Example legacy global init script

The following snippets run in a Python notebook create a legacy global init script named my-echo.sh in the DBFS location /databricks/init/:

  1. Create dbfs:/databricks/init/ if it doesn’t exist.

    dbutils.fs.mkdirs("dbfs:/databricks/init/")
    
  2. Display the list of existing global init scripts.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    
  3. Create a script that simply appends to a file.

    dbutils.fs.put("dbfs:/databricks/init/my-echo.sh" ,"""
    #!/bin/bash
    
    echo "hello" >> /hello.txt
    """, True)
    
  4. Check that the script exists.

    display(dbutils.fs.ls("dbfs:/databricks/init/"))
    

Cluster-named init scripts (deprecated)

Cluster-named scripts scope to a single cluster, specified by the cluster’s name. Cluster-named init scripts must be stored in the directory dbfs:/databricks/init/<cluster-name>. For example, to specify init scripts for the cluster named PostgreSQL, create the directory dbfs:/databricks/init/PostgreSQL, and put all scripts that should run on cluster PostgreSQL in that directory.

Important

  • Cluster-named init scripts are deprecated. You should use cluster-scoped init scripts instead.
  • You cannot use cluster-named init scripts for clusters that run jobs because job cluster names are generated on the fly. However, you can use cluster-scoped init scripts for job clusters.
  • Avoid spaces in cluster names since they’re used in the script and output paths.
  • If there is more than one cluster-named init script, the order of execution is undetermined and depends on the order that the DBFS or S3 client returns the scripts.

To delete a cluster-named init script, delete the init script file. You can perform this in a notebook, or using the DBFS API, or using the DBFS CLI. For example:

dbutils.fs.rm("dbfs:/databricks/init/PostgreSQL/postgresql-install.sh")

Example cluster-named init script

The following snippets run in a Python notebook create an init script named PostgreSQL, in the DBFS location /databricks/init, that installs the PostgreSQL JDBC driver on that cluster. You can create a customizable command if you create a variable clusterName that holds the cluster name.

  1. Create dbfs:/databricks/init/ if it doesn’t exist.

    dbutils.fs.mkdirs("dbfs:/databricks/init/")
    
  2. Configure a cluster name variable.

    clusterName = "PostgreSQL"
    
  3. Create a subdirectory named PostgreSQL.

    dbutils.fs.mkdirs("dbfs:/databricks/init/%s/"%clusterName)
    
  4. Create the script into the directory PostgreSQL.

    dbutils.fs.put("/databricks/init/PostgreSQL/postgresql-install.sh","""
    #!/bin/bash
    wget --quiet -O /mnt/driver-daemon/jars/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar
    wget --quiet -O /mnt/jars/driver-daemon/postgresql-42.2.2.jar https://repo1.maven.org/maven2/org/postgresql/postgresql/42.2.2/postgresql-42.2.2.jar""", True)
    
  5. Check that the cluster-specific init script exists.

    display(dbutils.fs.ls("dbfs:/databricks/init/%s/postgresql-install.sh"%clusterName))
    

Legacy global and cluster-named init script logs (deprecated)

Databricks saves all init script output for legacy global and cluster-named init scripts to a file in DBFS named as follows: dbfs:/databricks/init/output/<cluster-name>/<date-timestamp>/<script-name>_<node-ip>.log. For example, if a cluster PostgreSQL has two Spark nodes with IPs 10.0.0.1 and 10.0.0.2, and the init script directory has a script called installpostgres.sh, there will be two output files at the following paths:

  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.1.log
  • dbfs:/databricks/init/output/PostgreSQL/2016-01-01_12-00-00/installpostgres.sh_10.0.0.2.log