Databricks Runtime 6.6 for Genomics (Unsupported)

Databricks released this image in May 2020.

Databricks Runtime 6.6 for Genomics is a version of Databricks Runtime 6.6 (Unsupported) optimized for working with genomic and biomedical data. It is a component of the Databricks Unified Analytics Platform for Genomics.

For more information, including instructions for creating a Databricks Runtime for Genomics cluster, see Databricks Runtime for Genomics (Deprecated). For more information on developing genomics applications, see Genomics guide.

New features

Databricks Runtime 6.6 for Genomics is built on top of Databricks Runtime 6.6. For information on what’s new in Databricks Runtime 6.6, see the Databricks Runtime 6.6 (Unsupported) release notes.

GFF3 reader

The version of Glow included in Databricks Runtime 6.6 for Genomics can read GFF3 files. The DataFrame schema is inferred from the present attributes. We added this feature in open source.

Custom reference genome support

We now support custom reference genomes for the DNASeq, tumor/normal, and joint genotyping pipelines.

Per-sample pipeline timeouts

The DNASeq, RNASeq, and tumor/normal pipelines now have an option to set a per-sample timeout.

BAM export option

The DNASeq, RNASeq, and tumor/normal pipelines now have an option to export to BAM. Aligned reads can be exported as a single BAM or as sharded BAMs.

Manifest blobs

Manifests for the DNASeq, RNASeq, tumor/normal, and joint genotyping pipelines can now be provided via a blob as well as a path. If the manifest is provided via a blob, all paths must be absolute.

Improvements

Variant normalizer flexibility

The Glow variant normalizer now accepts compressed reference sequences, such as block-gzipped FASTA files. We added this improvement in open source.

Pipe transformer tolerates empty partitions

The Glow pipe transformer now ignores empty partitions, so that users no longer have to coalesce the input DataFrame. We added this improvement in open source.

Packaged library versions documentation

BAMs and VCFs output from the DNASeq, RNASeq, tumor/normal, and joint genotyping pipelines now document the relevant library versions in their headers.

Duplicate marking performance

Duplicate marking during the read alignment stage of the DNASeq pipeline is now faster and requires less memory.

Other changes

The genotypeGivenAlleles and emitAllAlleles options have been removed from the joint genotyping pipeline.

Libraries

The following libraries included in Databricks Runtime 6.6 for Genomics differ from those included in Databricks Runtime 6.6.

Upgraded libraries

  • GATK: 4.0.11.0 to 4.1.4.1

Packaged libraries

Library Version
ADAM 0.30.0
GATK 4.1.4.1
Hadoop-bam 7.9.2
Hail 0.2.40
samtools 1.9
VEP 96