Tumor/Normal pipeline
Important
This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.
The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.
For more information about the Databricks Runtime deprecation policy and schedule, see All supported Databricks Runtime releases.
The Databricks tumor/normal pipeline is a GATK best practices compliant pipeline for short read alignment and somatic variant calling using the MuTect2 variant caller.
Walkthrough
The pipeline consists of the following steps:
Normal sample alignment using BWA-MEM.
Tumor sample alignment using BWA-MEM.
Variant calling with MuTect2.
Setup
The pipeline is run as a Databricks job. You can set up a cluster policy to save the configuration:
{
"num_workers": {
"type": "unlimited",
"defaultValue": 13
},
"node_type_id": {
"type": "unlimited",
"defaultValue": "c5.9xlarge"
},
"spark_env_vars.refGenomeId": {
"type": "unlimited",
"defaultValue": "grch38"
},
"spark_version": {
"type": "regex",
"pattern": ".*-hls.*",
"defaultValue": "7.4.x-hls-scala2.12"
},
"aws_attributes.ebs_volume_count": {
"type": "unlimited",
"defaultValue": 3
},
"aws_attributes.ebs_volume_size": {
"type": "unlimited",
"defaultValue": 200
}
}
The cluster configuration should use Databricks Runtime for Genomics.
The task should be the tumor/normal notebook found at the bottom of this page.
For best performance, use compute optimized instances with at least 60GB of memory. We recommend c5.9xlarge.
If you’re running base quality score recalibration, use general purpose (m5.4xlarge) instances instead since this operation requires more memory.
To reduce costs, use all spot workers with the Spot fall back to On-demand option selected.
Attach 3 200GB SSD EBS volumes
Reference genomes
You must configure the reference genome using an environment variable. To use GRCh37, set the environment variable:
refGenomeId=grch37
To use GRCh38, change grch37
to grch38
.
To use a custom reference genome, see instructions in Custom reference genomes.
Parameters
The pipeline accepts parameters that control its behavior. The most important and commonly changed parameters are documented here. To view all available parameters and their usage information, run the first cell of the pipeline notebook. New parameters are added regularly. After importing the notebook and setting it as a job task, you can set these parameters for all runs or per-run.
Parameter |
Default |
Description |
---|---|---|
manifest |
n/a |
The manifest describing the input. |
output |
n/a |
The path where pipeline output should be written. |
replayMode |
skip |
|
exportVCF |
false |
If true, the pipeline writes results to a VCF file as well as Delta. |
perSampleTimeout |
12h |
A timeout applied per sample. After reaching this timeout, the pipeline continues on to the next sample. The value of this parameter must include a timeout unit: ‘s’ for seconds, ‘m’ for minutes, or ‘h’ for hours. For example, ‘60m’ results in a timeout of 60 minutes. |
Tip
To optimize run time, set the spark.sql.shuffle.partitions
Spark configuration to three times the number of cores of the cluster.
Manifest format
The manifest is a CSV file or blob describing where to find the input FASTQ or BAM files. For example:
pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*_R1_*.normal.fastq.bgz,HG001_normal,normal,1,read_group_normal
HG001,*_R2_*.normal.fastq.bgz,HG001_normal,normal,2,read_group_normal
HG001,*_R1_*.tumor.fastq.bgz,HG001_tumor,1,tumor,read_group_tumor
HG001,*_R2_*.tumor.fastq.bgz,HG001_tumor,2,tumor,read_group_tumor
If your input consists of unaligned BAM files, you should omit the paired_end
field:
pair_id,file_path,sample_id,label,paired_end,read_group_id
HG001,*.normal.bam,HG001_normal,normal,,read_group_tumor
HG001,*.tumor.bam,HG001_tumor,tumor,,read_group_normal
The tumor and normal samples for a given individual are grouped by the pair_id
field. The tumor and normal sample names read group names must be different within a pair.
Tip
If the provided manifest is a file, the file_path
field in each row may be an absolute path or a path relative to
the manifest file. If the provided manifest is a blob, the file_path
field must be an absolute path. You can
include globs (*)
to match many files.
Additional usage info and troubleshooting
The tumor/normal pipeline shares many operational details with the other Databricks pipelines. For more detailed usage information, such as output format structure, tips for running programmatically, steps for setting up custom reference genomes, and common issues, see DNASeq pipeline.
Note
The pipeline was renamed from TNSeq to MutSeq in Databricks Runtime 7.3 LTS for Genomics and above.