Databricks Runtime 7.0 for Genomics (Unsupported)

Databricks released this image in June 2020.

Databricks Runtime 7.0 for Genomics is a version of Databricks Runtime 7.0 (Unsupported) optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.

For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Genomics guide. For more information on developing genomics applications, see Genomics guide.

New features

Databricks Runtime 7.0 for Genomics is built on top of Databricks Runtime 7.0. For information on what’s new in Databricks Runtime 7.0, see the Databricks Runtime 7.0 (Unsupported) release notes.

GloWGR: Whole genome regression

Glow now includes a scalable whole genome regression method, GloWGR. GloWGR is a distributed version of the single-node tool regenie. GloWGR is an enterprise-ready tool that provides equivalent accuracy to other methods for whole-genome regression, but with an order-of-magnitude improvement in speed. For details, see whole genome regression in open source.

Transformers accept non-string typed arguments

All Glow transformers, including the pipe transformer and variant normalizer, now accept arguments whose values are not strings. The Glow documentation for the pipe transformer reflects the new usage. For backwards compatibility, string values are still accepted for all arguments.

Numpy ndarray literals

You can now pass literal numpy 1D and 2D float-typed ndarrays to functions that expect DataFrame columns with types array<double> and DenseMatrix respectively. The Glow genome-wide association study documentation demonstrates the new usage.

Mean substitution function

Glow now provides a mean_substitute function to substitute missing values in an array with the mean of the non-missing values.

Improvements

Joint genotyping performance

The performance of the Joint genotyping pipeline has improved by 5-20%. The improvement is particularly pronounced when using cluster node types with many cores per node.

VCF reader ignores tabix index files

In previous releases, the VCF reader could fail when reading a directory of VCF files if the directory contained tabix index files. The reader would attempt to interpret the tabix files as VCF files and report an error. Now, the reader only uses index files to determine which data files to read.

Removed splitToBiallelic option from VCF reader

This option has been removed in favor of the split_multiallelics transformer. The transformer is faster and more accurate than the VCF reader option.

Libraries

The following sections list the libraries included in Databricks Runtime 7.0 for Genomics that differ from those included in Databricks Runtime 7.0.

Upgraded libraries

  • ADAM: 0.30.0 to 0.32.0

Removed libraries

Hail is not included in Databricks Runtime 7.0 for Genomics as there is no release based on Apache Spark 3.0.

Packaged libraries

Library

Version

ADAM

0.32.0

GATK

4.1.4.1

Hadoop-bam

7.9.2

samtools

1.9

VEP

96