Hail
Hail is a library built on Apache Spark for analyzing large genomic datasets.
Important
When you use Hail 0.2.65 and above, use Apache Spark version 3.1 (Databricks Runtime 8.x or 9.x)
Install Hail on Databricks Runtime, not Databricks Runtime for Genomics (deprecated)
Hail is not supported with Credential passthrough (legacy)
Hail is not supported with Glow, except when exporting from Hail to Glow
Create a cluster
Install Hail via Docker with Databricks Container Services.
For containers to set up a Hail environment, see the ProjectGlow Dockerhub page.
Use projectglow/databricks-hail:<hail-version>
, replacing the tag with an available Hail version.
Create a jobs cluster with Hail
Setup the Databricks CLI.
Create a cluster using the Hail Docker container, setting the tag to the desired
<hail-version>
.An example jobs definition is given below, please edit notebook_path, Databricks Runtime
<databricks-runtime-version>
and<hail-version>
.
databricks jobs create --json-file hail-create-job.json
hail-create-job.json
:
{
"name": "hail-job",
"notebook_task": {
"notebook_path" : "/Users/<user@organization.com>/hail/docs/hail-tutorial"
},
"new_cluster": {
"spark_version": "<databricks-runtime-version>.x-scala2.12",
"aws_attributes": {
"availability": "SPOT",
"first_on_demand": 1
},
"node_type_id": "r5d.4xlarge",
"num_workers": 32,
"docker_image": {
"url": "projectglow/databricks-hail:<hail-version>"
}
}
}
Use Hail in a notebook
For the most part, Hail in Databricks works identically to the Hail documentation. However, there are a few modifications that are necessary for the Databricks environment.
Initialize Hail
When initializing Hail, pass in the pre-created SparkContext
and mark the initialization as idempotent. This setting
enables multiple Databricks notebooks to use the same Hail context.
Note
Enable skip_logging_configuration
to save logs to the rolling driver log4j output. This setting is
supported only in Hail 0.2.39 and above.
import hail as hl
hl.init(sc, idempotent=True, quiet=True, skip_logging_configuration=True)
Display Bokeh plots
Hail uses the Bokeh library to create plots. The show
function built into Bokeh does not work
in Databricks. To display a Bokeh plot generated by Hail, you can run a command like:
from bokeh.embed import components, file_html
from bokeh.resources import CDN
plot = hl.plot.histogram(mt.DP, range=(0,30), bins=30, title='DP Histogram', legend='DP')
html = file_html(plot, CDN, "Chart")
displayHTML(html)
See Bokeh for more information.