Joint genotyping pipeline

Important

This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.

The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.

For more information about the Databricks Runtime deprecation policy and schedule, see All supported Databricks Runtime releases.

The Databricks joint genotyping pipeline is a GATK best practices compliant pipeline for joint genotyping using GenotypeGVCFs.

Walkthrough

The pipeline typically consists of the following steps:

  1. Ingest variants into Delta Lake.

  2. Joint-call the cohort with GenotypeGVCFs.

During variant ingest, single-sample gVCFs are processed in batches and the rows are stored in Delta Lake to provide fault tolerance, fast querying, and incremental joint genotyping. In the joint genotyping step, the gVCF rows are ingested from Delta Lake, split into bins, and distributed to partitions. For each variant site, the relevant gVCF rows per sample are identified and used for regenotyping.

Setup

The pipeline is run as a Databricks job. Most likely a Databricks solutions architect will work with you to set up the initial job. The necessary details are:

{
  "autoscale.min_workers": {
    "type": "unlimited",
    "defaultValue": 1
  },
  "autoscale.max_workers": {
    "type": "unlimited",
    "defaultValue": 25
  },
  "enable_elastic_disk": {
    "type": "fixed",
    "value": true
  },
  "node_type_id": {
    "type": "unlimited",
    "defaultValue": "i3.8xlarge"
  },
  "spark_env_vars.refGenomeId": {
    "type": "unlimited",
    "defaultValue": "grch38"
  },
  "spark_version": {
    "type": "regex",
    "pattern": ".*-hls.*",
    "defaultValue": "7.4.x-hls-scala2.12"
  }
}
  • The cluster configuration should use Databricks Runtime for Genomics.

  • The task should be the joint genotyping pipeline notebook found at the bottom of this page.

  • For best performance, use storage-optimized instances. We recommend i3.8xlarge.

  • To reduce costs, use all spot workers with the Spot fall back to On-demand option selected.

  • To reduce costs, enable autoscaling with a minimum of 1 worker and a maximum of 10-50 depending on latency requirements.

  • Enable autoscaling local storage to ensure that the cluster doesn’t run out of disk space

Reference genomes

You must configure the reference genome using environment variables. To use GRCh37, set the environment variable:

refGenomeId=grch37

To use GRCh38, change grch37 to grch38.

To use a custom reference genome, see instructions in Custom reference genomes.

Parameters

The pipeline accepts parameters that control its behavior. The most important and commonly changed parameters are documented here. To view all available parameters and their usage information, run the first cell of the pipeline notebook. New parameters are added regularly. After importing the notebook and setting it as a job task, you can set these parameters for all runs or per-run.

Parameter

Default

Description

manifest

n/a

The manifest describing the input.

output

n/a

The path where pipeline output is written.

replayMode

skip

One of:

  • skip: stages are skipped if output already exists.

  • overwrite: existing output is deleted.

exportVCF

false

If true, the pipeline writes results in VCF as well as Delta Lake.

targetedRegions

n/a

Path to files containing regions to call. If omitted, calls all regions.

gvcfDeltaOutput

n/a

If specified, gVCFs are ingested to a Delta table before genotyping. You should specify this parameter only if you expect to joint call the same gVCFs many times.

performValidation

false

If true, the system verifies that each record contains the necessary information for joint genotyping. In particular, it checks that the correct number of genotype probabilities are present.

validationStringency

STRICT

How to handle malformed records, both during loading and validation.

  • STRICT: fail the job

  • LENIENT: log a warning and drop the record

  • SILENT: drop the record without a warning

Tip

To perform joint calling from an existing Delta table, set gvcfDeltaOutput to the table path and replayMode to skip. You can also provide the manifest, which will be used to define the VCF schema and samples; these will be inferred from the Delta table otherwise. We ignore the targetedRegions and performValidation parameters in this setup.

Output

The regenotyped variants are all written out to Delta tables inside the provided output directory. In addition, if you configured the pipeline to export VCFs, they’ll appear under the output directory as well.

output
|---genotypes
    |---Delta files
|---genotypes.vcf
    |---VCF files

Manifest format

The manifest is a file or blob describing where to find the input single-sample GVCF files, with each file path on a new row. For example:

HG00096.g.vcf.bgz
HG00097.g.vcf.bgz

Tip

If the provided manifest is a file, each row may be an absolute path or a path relative to the manifest file. If the provided manifest is a blob, the row field must be an absolute path. You can include globs (*) to match many files.

Troubleshooting

Job fails with an ArrayIndexOutOfBoundsException

This error usually indicates that an input record has an incorrect number of genotype probabilities. Try setting the performValidation option to true and the validationStringency option to LENIENT or SILENT.

Additional usage info

The joint genotyping pipeline shares many operational details with the other Databricks pipelines. For more detailed usage information, such as output format structure, tips for running programmatically, and steps for setting up custom reference genomes, see DNASeq pipeline.