Databricks Runtime 7.3 LTS for Genomics

Databricks released this image in September 2020. It was declared Long Term Support (LTS) in October 2020.

Databricks Runtime 7.3 LTS for Genomics is a version of Databricks Runtime 7.3 LTS optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.

Note

Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule, see Supported Databricks runtime releases and support schedule.

For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Databricks Runtime for Genomics (Deprecated). For more information on developing genomics applications, see Genomics guide.

For help with migration from Databricks Runtime 6.x to Databricks Runtime 7.3 LTS, see Databricks Runtime 7.x migration guide.

New features

Databricks Runtime 7.3 LTS for Genomics is built on top of Databricks Runtime 7.3 LTS. For information on what’s new in Databricks Runtime 7.3 LTS, see the Databricks Runtime 7.3 LTS release notes.

Support for reading BGEN files with uncompressed or zstd-compressed genotypes

Glow now supports reading BGEN files containing SNP block probability data that is uncompressed or compressed using zstandard’s ZSTD_compress() function, in addition to the existing support for reading data compressed using zlib’s compress() function.

Improvements

Variant liftOver performance

Performing variant liftOver with Glow is now up to 12x faster.

Faster big file upload to ABFS

Writing big files (such as VCF, BGEN and BAM) to the Azure Blob File System is now up to 2x faster.

Performance of DNASeq pipeline on autoscaling clusters

The DNASeq pipeline is now better tuned for autoscaling clusters.

Pipelines output bgzipped VCFs by default

All genomics pipelines now default to compressing output VCFs using bgzip. The output VCFs were previously uncompressed by default. To configure this, change the vcfCompressionCodec pipeline option from bgzf.

Refactors

TNSeq pipeline renamed to MutSeq

The Tumor/Normal pipeline has been renamed from TNSeq to MutSeq.

Libraries

The following sections list the libraries included in Databricks Runtime 7.3 LTS for Genomics that differ from those included in Databricks Runtime 7.3.

Packaged libraries

Library Version
ADAM 0.32.0
GATK 4.1.4.1
Hadoop-bam 7.9.2
samtools 1.9
VEP 96