Databricks Runtime 6.2 for Genomics

Databricks released this image in December 2019.

Databricks Runtime for Genomics (Databricks Runtime Genomics) is a variant of Databricks Runtime 6.2 optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.

For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Databricks Runtime for Genomics. For more information on developing genomics applications, see Genomics.

New features

Databricks Runtime 6.2 for Genomics is built on top of Databricks Runtime 6.2. For information on what’s new in Databricks Runtime 6.2, see the Databricks Runtime 6.2 release notes.

Firth logistic regression

The version of Glow included in Databricks Runtime 6.2 for Genomics provides a Firth logistic regression test.

User-defined sample quality control metrics

You can aggregate over genotypes for each sample in a DataFrame using aggregate_by_index. This function allows you to compute per-sample quality control (QC) metrics that are included in built-in QC functions.

Improvements

Pipe transformer performance

The overhead of the pipe transformer has been reduced by roughly half. This speedup means that you can use Databricks Runtime for Genomics to parallelize command-line tools without sacrificing per-core efficiency.

Joint genotyping robustness

The joint genotyping provided in Databricks Runtime 6.2 for Genomics more efficiently handles sample manifests with thousands of entries. In addition, the pipeline now handles missing gVCF blocks gracefully by inserting explicit no-calls.

Simplified integration with LOFTEE

The VEP annotation pipeline included in Databricks Runtime for Genomics provides streamlined integration with LOFTEE.

Hail 0.26.0

Databricks Runtime 6.2 for Genomics includes Hail 0.26.0.

Samtools 1.9

Samtools 1.9 is now installed.