Databricks Runtime 7.3 LTS for Genomics (Unsupported)

Databricks released this image in September 2020. It was declared Long Term Support (LTS) in October 2020.

Databricks Runtime 7.3 LTS for Genomics is a version of Databricks Runtime 7.3 LTS optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.

Important

This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.

The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.

For more information about the Databricks Runtime deprecation policy and schedule, see Supported Databricks runtime releases and support schedule.

For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Genomics guide. For more information on developing genomics applications, see Genomics guide.

For help with migration from Databricks Runtime 6.x to Databricks Runtime 7.3 LTS, see Databricks Runtime 7.x migration guide.

New features

Databricks Runtime 7.3 LTS for Genomics is built on top of Databricks Runtime 7.3 LTS. For information on what’s new in Databricks Runtime 7.3 LTS, see the Databricks Runtime 7.3 LTS release notes.

Support for reading BGEN files with uncompressed or zstd-compressed genotypes

Glow now supports reading BGEN files containing SNP block probability data that is uncompressed or compressed using zstandard’s ZSTD_compress() function, in addition to the existing support for reading data compressed using zlib’s compress() function.

Improvements

Variant liftOver performance

Performing variant liftOver with Glow is now up to 12x faster.

Faster big file upload to ABFS

Writing big files (such as VCF, BGEN and BAM) to the Azure Blob File System is now up to 2x faster.

Performance of DNASeq pipeline on autoscaling clusters

The DNASeq pipeline is now better tuned for autoscaling clusters.

Pipelines output bgzipped VCFs by default

All genomics pipelines now default to compressing output VCFs using bgzip. The output VCFs were previously uncompressed by default. To configure this, change the vcfCompressionCodec pipeline option from bgzf.

Refactors

TNSeq pipeline renamed to MutSeq

The Tumor/Normal pipeline has been renamed from TNSeq to MutSeq.

Libraries

The following sections list the libraries included in Databricks Runtime 7.3 LTS for Genomics that differ from those included in Databricks Runtime 7.3.

Packaged libraries

Library

Version

ADAM

0.32.0

GATK

4.1.4.1

Hadoop-bam

7.9.2

samtools

1.9

VEP

96