DNASeq pipeline
Important
This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.
The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.
For more information about the Databricks Runtime deprecation policy and schedule, see All supported Databricks Runtime releases.
Note
The following library versions are packaged in Databricks Runtime 7.0 for Genomics. For libraries included in lower versions of Databricks Runtime for Genomics, see the release notes.
The Databricks DNASeq pipeline is a GATK best practices compliant pipeline for short read alignment, variant calling, and variant annotation. It uses the following software packages, parallelized using Spark.
BWA v0.7.17
ADAM v0.32.0
GATK HaplotypeCaller v4.1.4.1
SnpEff v4.3
For more information about the pipeline implementation and expected runtimes and costs for various option combinations, see Building the Fastest DNASeq Pipeline at Scala.
Setup
The pipeline is run as a Databricks job. You can set up a cluster policy to save the configuration:
{
"num_workers": {
"type": "unlimited",
"defaultValue": 13
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "c5.9xlarge"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
},
"aws_attributes.ebs_volume_count": {
"type": "unlimited",
"defaultValue": 3
},
"aws_attributes.ebs_volume_size": {
"type": "unlimited",
"defaultValue": 200
}
}
The cluster configuration should use Databricks Runtime for Genomics.
The task should be the DNASeq notebook found at the bottom of this page.
For best performance, use compute optimized instances with at least 60GB of memory. We recommend c5.9xlarge.
If you’re running base quality score recalibration, use general purpose (m5.4xlarge) instances instead since this operation requires more memory.
To reduce costs, use all spot workers with the Spot fall back to On-demand option selected.
Attach 3 200GB SSD EBS volumes
Reference genomes
You must configure the reference genome using an environment variable. To use GRCh37, set the environment variable:
refGenomeId=grch37
To use GRCh38 instead, replace grch37
with grch38
.
Custom reference genomes
To use a reference build other than GRCh37 or GRCh38, follow these steps:
Prepare the reference for use with BWA and GATK.
The reference genome directory contents should include these files:
<reference-name>.dict <reference-name>.fa <reference-name>.fa.amb <reference-name>.fa.ann <reference-name>.fa.bwt <reference-name>.fa.fai <reference-name>.fa.pac <reference-name>.fa.sa
Upload the reference genome files to a directory in cloud storage or DBFS. If you upload the files to cloud storage, you must mount the directory to a location in DBFS.
In your cluster configuration, set an environment variable
REF_GENOME_PATH
that points to the path of the fasta file in DBFS. For example,REF_GENOME_PATH=/mnt/reference-genome/reference.fa
The path must not include a
dbfs:
prefix.When you use a custom reference genome, the SnpEff annotation stage is skipped.
Tip
During cluster initialization, the Databricks DNASeq pipeline uses the provided BWA index files to generate an index image file. If you plan to use the same reference genome many times, you can accelerate cluster startup by building the index image file ahead of time. This process will reduce cluster startup time by about 30 seconds.
Copy the reference genome directory to the driver node of a Databricks Runtime for Genomics cluster.
%sh cp -r /dbfs/<reference-dir-path> /local_disk0/reference-genome
Generate the index image file from the BWA index files.
import org.broadinstitute.hellbender.utils.bwa._ BwaMemIndex.createIndexImageFromIndexFiles("/local_disk0/reference-genome/<reference-name>.fa", "/local_disk0/reference-genome/<reference-name>.fa.img")
Copy to the index image file to the same directory as the reference fasta files.
%sh cp /local_disk0/reference-genome/<reference-name>.fa.img /dbfs/<reference-dir-path>
Delete the unneeded BWA index files (
.amb
,.ann
,.bwt
,.pac
,.sa
) from DBFS.%fs rm <file>
Parameters
The pipeline accepts parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the DNASeq notebook. After importing the notebook and setting it as a job task, you can set these parameters for all runs or per-run.
Parameter |
Default |
Description |
---|---|---|
manifest |
n/a |
The manifest describing the input. |
output |
n/a |
The path where pipeline output should be written. |
replayMode |
skip |
One of:
|
exportVCF |
false |
If true, the pipeline writes results in VCF as well as Delta Lake. |
referenceConfidenceMode |
NONE |
One of:
|
perSampleTimeout |
12h |
A timeout applied per sample. After reaching this timeout, the pipeline continues on to the next sample. The value of this parameter must include a timeout unit: ‘s’ for seconds, ‘m’ for minutes, or ‘h’ for hours. For example, ‘60m’ results in a timeout of 60 minutes. |
Tip
To optimize run time, set the spark.sql.shuffle.partitions
Spark configuration to three times the number of cores of the cluster.
Customization
You can customize the DNASeq pipeline by disabling read alignment, variant calling, and variant annotation. By default, all three stages are enabled.
val pipeline = new DNASeqPipeline(align = true, callVariants = true, annotate = true)
To disable variant annotation, set the pipeline as follows:
val pipeline = new DNASeqPipeline(align = true, callVariants = true, annotate = false)
The permitted stage combinations are:
Read alignment |
Variant calling |
Variant annotation |
---|---|---|
true |
true |
true |
true |
true |
false |
true |
false |
false |
false |
true |
true |
false |
true |
false |
Manifest format
The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. For example:
file_path,sample_id,paired_end,read_group_id
*_R1_*.fastq.bgz,HG001,1,read_group
*_R2_*.fastq.bgz,HG001,2,read_group
If your input consists of unaligned BAM files, you should omit the paired_end
field:
file_path,sample_id,paired_end,read_group_id
*.bam,HG001,,read_group
Tip
If the provided manifest is a file, the file_path
field in each row can be an absolute path or a path relative to
the manifest file. If the provided manifest is a blob, the file_path
field must be an absolute path. You can
include globs (*)
to match many files.
Supported input formats
SAM
BAM
CRAM
Parquet
FASTQ
bgzip
*.fastq.bgz
(recommended) bgzipped files with the*.fastq.gz
extension are recognized asbgz
.uncompressed
*.fastq
gzip
*.fastq.gz
Important
Gzipped files are not splittable. Choose autoscaling clusters to minimize cost for these files.
To block compress a FASTQ, install htslib, which includes the bgzip
executable.
Locally
gunzip -c <my-file>.gz | bgzip -c | aws s3 cp - s3://<my-s3-file-path>.bgz
From S3
aws s3 cp s3://<my-s3-file-path>.gz - | gunzip -c | bgzip -c | aws s3 cp - s3://<my-s3-file-path>.bgz
Output
The aligned reads, called variants, and annotated variants are all written out to Delta tables inside the provided output directory if the corresponding stages are enabled. Each table is partitioned by sample ID. In addition, if you configured the pipeline to export BAMs or VCFs, they’ll appear under the output directory as well.
|---alignments
|---sampleId=HG001
|---Parquet files
|---alignments.bam
|---HG001.bam
|---annotations
|---Delta files
|---annotations.vcf
|---HG001.vcf
|---genotypes
|---Delta files
|---genotypes.vcf
|---HG001.vcf
When you run the pipeline on a new sample, it’ll appear as a new partition. If you run the pipeline for a sample that already appears in the output directory, that partition will be overwritten.
Since all the information is available in Delta Lake, you can easily analyze it with Spark in Python, R, Scala, or SQL. For example:
# Load the data
df = spark.read.format("delta").load("/genomics/output_dir/genotypes")
# Show all variants from chromosome 12
display(df.where("contigName == '12'").orderBy("sampleId", "start"))
-- Register the table in the catalog
CREATE TABLE genotypes
USING delta
LOCATION '/genomics/output_dir/genotypes'
Troubleshooting
Job is slow and few tasks are running
Usually indicates that the input FASTQ files are compressed with gzip
instead of bgzip
. Gzipped files are not splittable, so the input cannot be processed in parallel.
Run programmatically
In addition to using the UI, you can start runs of the pipeline programmatically using the Databricks CLI (legacy).

After setting up the pipeline job in the UI, copy the Job ID as you pass it to the jobs run-now
CLI command.
Here’s an example bash script that you can adapt for your workflow:
# Generate a manifest file
cat <<HERE >manifest.csv
file_path,sample_id,paired_end,read_group_id
dbfs:/genomics/my_new_sample/*_R1_*.fastq.bgz,my_new_sample,1,read_group
dbfs:/genomics/my_new_sample/*_R2_*.fastq.bgz,my_new_sample,2,read_group
HERE
# Upload the file to DBFS
DBFS_PATH=dbfs:/genomics/manifests/$(date +"%Y-%m-%dT%H-%M-%S")-manifest.csv
databricks fs cp manifest.csv $DBFS_PATH
# Start a new run
databricks jobs run-now --job-id <job-id> --notebook-params "{\"manifest\": \"$DBFS_PATH\"}"
In addition to starting runs from the command line, you can use this pattern to invoke the pipeline from automated systems like Jenkins.