Hail

Hail is a library built on Apache Spark for analyzing large genomic datasets.

Important

  • When you use Hail 0.2.65 and above, use Apache Spark version 3.1 (Databricks Runtime 8.x or 9.x)

  • Install Hail on Databricks Runtime, not Databricks Runtime for Genomics (deprecated)

  • Hail is not supported with Credential passthrough (legacy)

  • Hail is not supported with Glow, except when exporting from Hail to Glow

Create a cluster

Install Hail via Docker with Databricks Container Services.

For containers to set up a Hail environment, see the ProjectGlow Dockerhub page. Use projectglow/databricks-hail:<hail-version>, replacing the tag with an available Hail version.

  1. Create a jobs cluster with Hail

    1. Setup the Databricks CLI.

    2. Create a cluster using the Hail Docker container, setting the tag to the desired <hail-version>.

    3. An example jobs definition is given below, please edit notebook_path, Databricks Runtime <databricks-runtime-version> and <hail-version>.

    databricks jobs create --json-file hail-create-job.json
    

    hail-create-job.json:

{
  "name": "hail-job",
  "notebook_task": {
    "notebook_path" : "/Users/<user@organization.com>/hail/docs/hail-tutorial"
  },
  "new_cluster": {
    "spark_version": "<databricks-runtime-version>.x-scala2.12",
    "aws_attributes": {
      "availability": "SPOT",
      "first_on_demand": 1
    },
    "node_type_id": "r5d.4xlarge",
    "num_workers": 32,
    "docker_image": {
      "url": "projectglow/databricks-hail:<hail-version>"
    }
  }
}

Use Hail in a notebook

For the most part, Hail in Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Databricks environment.

Initialize Hail

When initializing Hail, pass in the pre-created SparkContext and mark the initialization as idempotent. This setting enables multiple Databricks notebooks to use the same Hail context.

Note

Enable skip_logging_configuration to save logs to the rolling driver log4j output. This setting is supported only in Hail 0.2.39 and above.

import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)

Display Bokeh plots

Hail uses the Bokeh library to create plots. The show function built into Bokeh does not work in Databricks. To display a Bokeh plot generated by Hail, you can run a command like:

from bokeh.embed import components, file_html
from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)

See Bokeh for more information.