In this topic:
The pipeline typically consists of the following steps:
- Ingest variants into Delta Lake.
- Joint-call the cohort with GenotypeGVCFs.
During variant ingest, single-sample gVCFs are processed in batches and the rows are stored in Delta Lake to provide fault tolerance, fast querying, and incremental joint genotyping. In the joint genotyping step, the gVCF rows are ingested from Delta Lake, split into bins, and distributed to partitions. For each variant site, the relevant gVCF rows per sample are identified and used for regenotyping.
The pipeline is run as a Databricks job. Most likely a Databricks solutions architect will work with you to set up the initial job. The necessary details are:
- The cluster configuration should use Databricks Runtime HLS.
- The task should be the joint genotyping pipeline notebook found at the bottom of this page.
- For best performance, use storage-optimized instances. We recommend
- To reduce costs, use all spot workers with the
Spot fall back to On-demandoption selected.
- To reduce costs, enable autoscaling with a minimum of 1 worker and a maximum of 10-50 depending on latency requirements.
- Enable autoscaling local storage to ensure that the cluster doesn’t run out of disk space
The pipeline accepts parameters that control its behavior. The most important and commonly changed parameters are documented here. To view all available parameters and their usage information, run the first cell of the pipeline notebook. New parameters are added regularly. Parameters can be set for all runs or per-run.
|manifest||n/a||The path of the manifest file describing the input.|
|output||n/a||The path where pipeline output is written.|
|exportVCF||false||If true, the pipeline writes results in VCF as well as Delta Lake.|
|targetedRegions||n/a||Path to files containing regions to call. If omitted, calls all regions.|
|genotypeGivenAlleles||false||If true, regenotypes variant sites based on the alleles in the input gVCFs.|
|emitAllSites||false||If true, retain low quality sites in the output.|
|gvcfDeltaOutput||n/a||If specified, gVCFs are ingested to a Delta Lake table before genotyping. You should specify this parameter only if you expect to joint call the same gVCFs many times.|
How to handle malformed records, both during loading and validation.
To keep rare variants, set
true. This is equivalent to changing
the GATK settings
DISCOVERY (choose the most probable alleles) to
GENOTYPE_GIVEN_ALLELES (use the alleles present in the input gVCFs), and
EMIT_VARIANTS_ONLY (produces calls only at variant sites) to
EMIT_ALL_SITES (produces calls at any callable
site regardless of confidence).
The regenotyped variants are all written out to Delta Lake tables inside the provided output directory. In addition, if you configured the pipeline to export VCFs, they’ll appear under the output directory as well.
output |---genotypes |---Delta files |---genotypes.vcf |---VCF files
You must configure the reference genome using environment variables. To use GRCh37, set the environment variable:
To use GRCh38, change
The manifest is a file describing where to find the input GVCF files, with each path on a new row.
Each row may be an absolute path or a path relative to the manifest. You can include globs
(*) to match many files.
- Job fails with an
- This error usually indicates that an input record has an incorrect number of genotype probabilities. Try setting the
The joint genotyping pipeline shares many operational details with the other Databricks pipelines. For more detailed usage information, such as output format structure, tips for running programmatically, and steps for setting up custom reference genomes, see DNASeq Pipeline.