Databricks Runtime 7.0 for Genomics (Unsupported)
Databricks released this image in June 2020.
Databricks Runtime 7.0 for Genomics is a version of Databricks Runtime 7.0 (Unsupported) optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.
For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Genomics guide. For more information on developing genomics applications, see Genomics guide.
New features
Databricks Runtime 7.0 for Genomics is built on top of Databricks Runtime 7.0. For information on what’s new in Databricks Runtime 7.0, see the Databricks Runtime 7.0 (Unsupported) release notes.
GloWGR: Whole genome regression
Glow now includes a scalable whole genome regression method, GloWGR. GloWGR is a distributed version of the single-node tool regenie. GloWGR is an enterprise-ready tool that provides equivalent accuracy to other methods for whole-genome regression, but with an order-of-magnitude improvement in speed. For details, see whole genome regression in open source.
Transformers accept non-string typed arguments
All Glow transformers, including the pipe transformer and variant normalizer, now accept arguments whose values are not strings. The Glow documentation for the pipe transformer reflects the new usage. For backwards compatibility, string values are still accepted for all arguments.
Numpy ndarray literals
You can now pass literal numpy 1D and 2D float-typed ndarrays to functions that expect DataFrame columns with
types array<double>
and DenseMatrix
respectively. The Glow genome-wide association
study documentation
demonstrates the new usage.
Mean substitution function
Glow now provides a mean_substitute function to substitute missing values in an array with the mean of the non-missing values.
Improvements
Joint genotyping performance
The performance of the Joint genotyping pipeline has improved by 5-20%. The improvement is particularly pronounced when using cluster node types with many cores per node.
VCF reader ignores tabix index files
In previous releases, the VCF reader could fail when reading a directory of VCF files if the directory contained tabix index files. The reader would attempt to interpret the tabix files as VCF files and report an error. Now, the reader only uses index files to determine which data files to read.
Removed splitToBiallelic
option from VCF reader
This option has been removed in favor of the split_multiallelics transformer. The transformer is faster and more accurate than the VCF reader option.