Databricks Runtime 6.2 for Genomics (Unsupported)
Databricks released this image in December 2019.
Databricks Runtime for Genomics (Databricks Runtime Genomics) is a variant of Databricks Runtime 6.2 (Unsupported) optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.
For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Genomics guide. For more information on developing genomics applications, see Genomics guide.
New features
Databricks Runtime 6.2 for Genomics is built on top of Databricks Runtime 6.2. For information on what’s new in Databricks Runtime 6.2, see the Databricks Runtime 6.2 (Unsupported) release notes.
Firth logistic regression
The version of Glow included in Databricks Runtime 6.2 for Genomics provides a Firth logistic regression test.
User-defined sample quality control metrics
You can aggregate over genotypes for each sample in a DataFrame using aggregate_by_index. This function allows you to compute per-sample quality control (QC) metrics that are included in built-in QC functions.
Improvements
Pipe transformer performance
The overhead of the pipe transformer has been reduced by roughly half. This speedup means that you can use Databricks Runtime for Genomics to parallelize command-line tools without sacrificing per-core efficiency.
Joint genotyping robustness
The joint genotyping provided in Databricks Runtime 6.2 for Genomics more efficiently handles sample manifests with thousands of entries. In addition, the pipeline now handles missing gVCF blocks gracefully by inserting explicit no-calls.
Simplified integration with LOFTEE
The VEP annotation pipeline included in Databricks Runtime for Genomics provides streamlined integration with LOFTEE.
Samtools 1.9
Samtools 1.9 is now installed.