Hail 0.2

Note

Hail is supported in all releases of Databricks Runtime 6.x for Genomics and in Databricks Runtime 7.4 for Genomics and above.

Hail is a library built on Apache Spark for analyzing large genomic datasets. Hail 0.2 is integrated into Databricks Runtime for Genomics.

Create a Hail cluster

To create a cluster with Hail installed:

  1. Set the following environment variable:

    ENABLE_HAIL=true
    

    This environment variable causes the cluster to launch with Hail 0.2, its dependencies, and Python 3.6 installed.

Use Hail in a notebook

For the most part, Hail 0.2 code in Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Databricks environment.

Initialization

When initializing Hail, pass in the pre-created SparkContext and mark the initialization as idempotent. This setting enables multiple Databricks notebooks to use the same Hail context.

Note

Enable skip_logging_configuration to save logs to the rolling driver log4j output. This setting is only supported in Databricks Runtime 6.6 for Genomics and above.

import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)

Plotting

Hail uses the Bokeh library to create plots. The show function built into Bokeh does not work in Databricks. To display a Bokeh plot generated by Hail, you can run a command like:

from bokeh.embed import components, file_html
from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)

See Bokeh for more information.

Limitations

  • When Hail support is enabled, your cluster uses Python 3.6, so notebooks written against different versions of Python may not work.
  • When Hail support is enabled, fewer Python libraries are installed by default. You can use the Libraries feature to install new libraries.

Convert to Glow

Use Glow’s Hail interoperation functions to convert variant data from a Hail MatrixTable to a Glow DataFrame.

Note

The conversion function from_matrix_table is available only in Databricks Runtime 7.5 for Genomics and above.

from glow.hail import functions
df = functions.from_matrix_table(mt, include_sample_ids=True)

After you’ve set up a Hail cluster, try out the Hail overview notebook.

Hail overview notebook

Open notebook in new tab