Hail is not yet supported on Apache Spark 3.0, and is therefore not available in Databricks Runtime 7.x for Genomics. Hail is supported in all releases of Databricks Runtime 6.x for Genomics.
To create a cluster with Hail installed:
Set the following environment variable:
This environment variable causes the cluster to launch with Hail 0.2, its dependencies, and Python 3.6 installed.
For the most part, Hail 0.2 code in Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Databricks environment.
When initializing Hail, pass in the pre-created
SparkContext and mark the initialization as idempotent. This setting
enables multiple Databricks notebooks to use the same Hail context.
skip_logging_configuration to save logs to the rolling driver log4j output. This setting is only
supported in Databricks Runtime 6.6 for Genomics.
import hail as hl hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)
Hail uses the Bokeh library to create plots. The
show function built into Bokeh does not work
in Databricks. To display a Bokeh plot generated by Hail, you can run a command like:
from bokeh.embed import components, file_html from bokeh.resources import CDN plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP') html = file_html(plot, CDN, "Chart") displayHTML(html)
See Bokeh for more information.
- When Hail support is enabled, your cluster uses Python 3.6, so notebooks written against different versions of Python may not work.
- When Hail support is enabled, fewer Python libraries are installed by default. You can still use the Libraries feature to install new libraries.
After you’ve set up a Hail cluster, try out the Hail overview notebook.