Analyzing 1000 Genomes with Spark and Hail

Hail is an open-source platform built on Spark for genomic data analysis.

This tutorial will give you a sense of Hail’s basic features. It consists of a set of four notebooks:

  • Deployment: how to deploy the Hail framework on Databricks.
  • Overview: a broad overview of Hail’s functionality, with emphasis on the functionality to manipulate and query a genetic dataset.
  • Introduction to the Expression Language: provides the basics of the Hail expression language and builds up practical experience with the type system, syntax, and functionality.
  • Expression Language Part 2: use the Hail expression language to query, filter, and annotate the thousand-genomes dataset from the overview.

To run the notebooks:

  1. Download the notebook archive and import into Databricks.

  2. Download libraries for Spark 2.1.1 and Scala 2.11:

    You can build Hail for other versions of Spark by following this tutorial.

  3. Follow the steps in the deployment notebook to upload the downloaded files and create Databricks libraries.