Databricks Runtime 6.3 for Genomics

Databricks released this image in January 2020.

Databricks Runtime for Genomics (Databricks Runtime Genomics) is a variant of Databricks Runtime 6.3 optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.

For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Databricks Runtime for Genomics. For more information on developing genomics applications, see Genomics.

New features

Databricks Runtime 6.3 for Genomics is built on top of Databricks Runtime 6.3. For information on what’s new in Databricks Runtime 6.3, see the Databricks Runtime 6.3 release notes.

Joint genotyping pipeline from Delta

The joint genotyping in Databricks Runtime 6.3 for Genomics can now take Delta tables written by the DNASeq pipeline as input. This functionality allows you to use the two pipelines together without exporting results to gVCFs.

Automatic annotation parsing when reading VCFs

The version of Glow included in Databricks Runtime 6.3 for Genomics automatically parses CSQ and ANN INFO fields when reading VCFs. INFO_CSQ and INFO_ANN fields in the resulting DataFrames now have structured schemas for simplified querying.

Improvements

Improved multiallelic variant splitter

The multiallelic variant splitter in Glow and Databricks Runtime for Genomics now handles more complex types of multiallelic sites. The new behavior mirrors the vt decompose command line tool. In addition, you can now use the splitter as a standalone transformer by calling glow.transform('split_multiallelics'....

Faster linear and logistic regression functions

The logistic_regression_gwas function in Databricks Runtime 6.3 for Genomics is about 60% faster than the version in 6.2. linear_regression_gwas is about 50% faster.