The Databricks tumor/normal pipeline requires Databricks Runtime HLS, which is in Beta. Interfaces and pricing are subject to change before general availability.
We recommend running the tumor/normal pipeline as a Databricks job. When run interactively, you are charged per DBU as well as per giga base pair.
The pipeline consists of the following steps:
- Normal sample alignment using BWA-MEM.
- Tumor sample alignment using BWA-MEM.
- Variant calling with MuTect2.
The pipeline is run as a Databricks job. Most likely, a Databricks solutions architect will work with you to set up the initial job. The necessary details are:
- The cluster configuration should use Databricks Runtime HLS.
- The task should be the tumor/normal pipeline notebook found at the bottom of this page.
- For best performance, use compute optimized instances with at least 60GB of memory. We
- To reduce costs, use all spot workers with the
Spot fall back to On-demandoption selected.
The pipeline accepts parameters that control its behavior. The most important and commonly changed parameters are documented here. To view all available parameters and their usage information, run the first cell of the pipeline notebook. New parameters are added regularly. Parameters can be set for all runs or per-run.
|manifest||n/a||The path of the manifest file describing the input.|
|output||n/a||The path where pipeline output is written.|
|exportVCF||false||If true, the pipeline writes results in VCF as well as Parquet.|
To optimize runtime, set
spark.sql.shuffle.partitions in the Spark config to three times the number of cores of the cluster.
You can configure pre-built reference genomes for human builds GRCh37 and GRCh38 using environment variables. By default, the pipeline runs against GRCh37. To use GRCh38 instead, set an environment variable like this:
The manifest is a CSV file describing where to find the input FASTQ or BAM files. An example:
pair_id,file_path,sample_id,paired_end,read_group_id HG001,*_R1_*.normal.fastq.bgz,HG001_normal,1,read_group_normal HG001,*_R2_*.normal.fastq.bgz,HG001_normal,2,read_group_normal HG001,*_R1_*.tumor.fastq.bgz,HG001_tumor,1,read_group_tumor HG001,*_R2_*.tumor.fastq.bgz,HG001_tumor,2,read_group_tumor
If your input consists of unaligned BAM files, you should omit the
pair_id,file_path,sample_id,paired_end,read_group_id HG001,*.normal.bam,HG001_normal,,read_group_tumor HG001,*.tumor.bam,HG001_tumor,,read_group_normal
The tumor and normal samples for a given individual are grouped by the
pair_id field. The tumor and normal sample names read group names must be different within a pair.
file_path field in each row may be an absolute path or a path relative to the manifest. You can include globs
(*) to match many files.
The tumor/normal pipeline shares many operational details with the other Databricks pipelines. For more detailed usage information, such as output format structure, tips for running programmatically, and steps for setting up custom reference genomes, see DNASeq Pipeline.