Databricks Runtime 6.6 for Genomics (Unsupported)
Databricks released this image in May 2020.
Databricks Runtime 6.6 for Genomics is a version of Databricks Runtime 6.6 (Unsupported) optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.
For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Genomics guide. For more information on developing genomics applications, see Genomics guide.
New features
Databricks Runtime 6.6 for Genomics is built on top of Databricks Runtime 6.6. For information on what’s new in Databricks Runtime 6.6, see the Databricks Runtime 6.6 (Unsupported) release notes.
GFF3 reader
The version of Glow included in Databricks Runtime 6.6 for Genomics can read GFF3 files. The DataFrame schema is inferred from the present attributes. We added this feature in open source.
Custom reference genome support
We now support custom reference genomes for the DNASeq, tumor/normal, and joint genotyping pipelines.
Per-sample pipeline timeouts
The DNASeq, RNASeq, and tumor/normal pipelines now have an option to set a per-sample timeout.
BAM export option
The DNASeq, RNASeq, and tumor/normal pipelines now have an option to export to BAM. Aligned reads can be exported as a single BAM or as sharded BAMs.
Manifest blobs
Manifests for the DNASeq, RNASeq, tumor/normal, and joint genotyping pipelines can now be provided via a blob as well as a path. If the manifest is provided via a blob, all paths must be absolute.
Improvements
Variant normalizer flexibility
The Glow variant normalizer now accepts compressed reference sequences, such as block-gzipped FASTA files. We added this improvement in open source.
Pipe transformer tolerates empty partitions
The Glow pipe transformer now ignores empty partitions, so that users no longer have to coalesce the input DataFrame. We added this improvement in open source.
Packaged library versions documentation
BAMs and VCFs output from the DNASeq, RNASeq, tumor/normal, and joint genotyping pipelines now document the relevant library versions in their headers.
Duplicate marking performance
Duplicate marking during the read alignment stage of the DNASeq pipeline is now faster and requires less memory.
Other changes
The genotypeGivenAlleles
and emitAllAlleles
options have been removed from the
joint genotyping pipeline.