Hail 0.2

Hail is a library built on Apache Spark for analyzing large genomic datasets.

Create a Hail cluster

You can install Hail with an init script.

  1. Create the base directory in which you want to put the init script. The following example uses dbfs:/databricks/scripts.

  2. Save the init script using this snippet:

    dbutils.fs.put(
    '/databricks/scripts/install-hail.sh',
    '''
    #!/bin/bash
    set -ex
    
    # Pick up user-provided environment variables, specifically HAIL_VERSION
    source /databricks/spark/conf/spark-env.sh
    
    /databricks/python/bin/pip install -U hail==$HAIL_VERSION
    hail_jar_path=$(find /databricks/python3 -name 'hail-all-spark.jar')
    cp $hail_jar_path /databricks/jars
    
    # Note: This configuration takes precedence since configurations are
    # applied in reverse-lexicographic order.
    cat <<HERE >/databricks/driver/conf/00-hail.conf
    [driver] {
      "spark.kryo.registrator" = "is.hail.kryo.HailKryoRegistrator"
      "spark.hadoop.fs.s3a.connection.maximum" = 5000
      "spark.serializer" = "org.apache.spark.serializer.KryoSerializer"
    }
    HERE
    
    echo $?
    ''',
      overwrite = True
    )
    

  3. Create a cluster with Databricks Runtime 6.4, the init script, and an environment variable to indicate the Hail version:

    HAIL_VERSION=0.2.61
    

Use Hail in a notebook

For the most part, Hail 0.2 code in Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Databricks environment.

Initialize Hail

When initializing Hail, pass in the pre-created SparkContext and mark the initialization as idempotent. This setting enables multiple Databricks notebooks to use the same Hail context.

Note

Enable skip_logging_configuration to save logs to the rolling driver log4j output. This setting is supported only in Hail 0.2.39 and above.

import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)

Display Bokeh plots

Hail uses the Bokeh library to create plots. The show function built into Bokeh does not work in Databricks. To display a Bokeh plot generated by Hail, you can run a command like:

from bokeh.embed import components, file_html
from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)

See Bokeh for more information.