Databricks Runtime 6.4 for Genomics (unsupported)

Databricks released this image in February 2020.

Databricks Runtime for Genomics (Databricks Runtime Genomics) is a variant of Databricks Runtime 6.4 (unsupported) optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.

Important

This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.

The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.

For more information about the Databricks Runtime deprecation policy and schedule, see All supported Databricks Runtime releases.

For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Genomics guide. For more information on developing genomics applications, see Genomics guide.

New features

Databricks Runtime 6.4 for Genomics is built on top of Databricks Runtime 6.4. For information on what’s new in Databricks Runtime 6.4, see the Databricks Runtime 6.4 (unsupported) release notes.

DNASeq pipeline customizations

DNASeq in Databricks Runtime 6.4 for Genomics can now be customized. Pipeline users can now selectively disable any legitimate combination of the read alignment, variant calling, and variant annotation stages. Users can also perform single-end read alignment.

Python and Scala APIs

The version of Glow included in Databricks Runtime 6.4 for Genomics includes Python and Scala APIs for functions previously exposed only via SQL expressions. These functions are available for DataFrame operations, providing improved compile-time safety.

Improvements

Flattened variant schema

The DNASeq and joint genotyping pipelines output variant data in a flattened schema to Delta Lake.

Improved variant normalizer

The variant normalizer in Glow and Databricks Runtime 6.4 for Genomics is about 2.5x faster than the version in Databricks Runtime 6.3 for Genomics. The new normalizer can be invoked as a transformer as well as a SQL function, preserves the original schema, and provides improved fault-tolerance.

Libraries

The following libraries included in Databricks Runtime 6.4 for Genomics differ from those included in Databricks Runtime 6.4.

Library

Version

ADAM

0.28.0

Hadoop-bam

7.9.2

Hail

0.2.26

GATK

4.0.11.0

samtools

1.9

VEP

96