Databricks Runtime 6.3 for Genomics (Unsupported)
Databricks released this image in January 2020.
Databricks Runtime for Genomics (Databricks Runtime Genomics) is a variant of Databricks Runtime 6.3 (Unsupported) optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.
For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Genomics guide. For more information on developing genomics applications, see Genomics guide.
New features
Databricks Runtime 6.3 for Genomics is built on top of Databricks Runtime 6.3. For information on what’s new in Databricks Runtime 6.3, see the Databricks Runtime 6.3 (Unsupported) release notes.
Joint genotyping pipeline from Delta
The joint genotyping in Databricks Runtime 6.3 for Genomics can now take Delta tables written by the DNASeq pipeline as input. This functionality allows you to use the two pipelines together without exporting results to gVCFs.
Automatic annotation parsing when reading VCFs
The version of Glow included in Databricks Runtime 6.3 for Genomics automatically
parses CSQ
and ANN
INFO fields when reading VCFs. INFO_CSQ
and INFO_ANN
fields in the
resulting DataFrames now have structured schemas for simplified querying.
Improvements
Improved multiallelic variant splitter
The multiallelic variant splitter in Glow and Databricks Runtime for Genomics now handles more complex types of multiallelic
sites. The new behavior mirrors the vt decompose
command line tool. In addition, you can now use the splitter as a standalone transformer by calling
glow.transform('split_multiallelics'...
.