Databricks Runtime 7.5 for Genomics (Unsupported)

Databricks released this image in December 2020.

Databricks Runtime 7.5 for Genomics is a version of Databricks Runtime 7.5 (Unsupported) optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.

Note

Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule, see Supported Databricks runtime releases and support schedule.

For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Databricks Runtime for Genomics (Deprecated). For more information on developing genomics applications, see Genomics guide.

New features

Databricks Runtime 7.5 for Genomics is built on top of Databricks Runtime 7.5. For information on what’s new in Databricks Runtime 7.5, see the Databricks Runtime 7.5 (Unsupported) release notes.

Conversion from Hail MatrixTable to Spark DataFrame

Glow now has the from_matrix_table function to import Hail MatrixTables as Spark DataFrames in Glow. See _.

pandas-based linear regression with offset

Glow now offers the linear_regression function in Python to test association between genotypes and one or more phenotypes (Step 2 of GloWGR). This function is significantly faster than the Glow linear_regression_gwas function (up to 8x on 25 phenotypes) and is designed to work seamlessly with the output of Step 1 GloWGR through acceptance of an offset argument. Covariates and whether to include an intercept in fitting can be controlled as well.

Improvements

Fast VCF reader by default

As of this release, the default VCF reader is set to the fast reader. To use the htsjdk based reader, set the Spark config io.projectglow.vcf.fastReaderEnabled to false.

Hard calls option for BGEN reader

The BGEN reader in Glow now accepts has the new boolean emitHardCalls option to generate hard calls for samples when reading the BGEN file. This option is set to true by default. The probability threshold for hard calls is set by the new hardCallThreshold option (default = 0.9).

Improvements to joint genotyping pipeline

The joint genotyping pipeline was improved such that the targeted regions file is now translated into a filter that can be pushed down to the VCF data source, where tabix index can be leveraged for filtering. Previously, a range join was used for this purpose. This will improve the ingest time if a targeted regions file is provided (with <25 regions) and the input is tabix-indexed bgzipped VCFs. In addition, the default bin size used in the pipeline was reduced to 5000. This change speeds up shuffling by reducing the skew, resulting in a faster pipeline.

Libraries

The following sections list the libraries included in Databricks Runtime 7.5 for Genomics that differ from those included in Databricks Runtime 7.5.

Packaged libraries

Library Version
ADAM 0.32.0
GATK 4.1.4.1
Hail 0.2.58
Hadoop-bam 7.9.2
samtools 1.9
VEP 96