SnpEff Annotation Pipeline

Beta

The Databricks SnpEff annotation pipeline requires Databricks Runtime HLS, which is in Beta. Interfaces and pricing are subject to change before general availability.

Setup

SnpEff (v4.3) is run as a Databricks job. Most likely, a Databricks solutions architect will set up the initial job for you. The necessary details are:

  • The cluster configuration should use Databricks Runtime HLS.
  • The task should be the SnpEffAnnotationPipeline notebook that has been imported into your workspace via the link below.
  • To reduce costs, use all spot workers with the Spot fall back to On-demand option selected.

Benchmarks

The pipeline has been tested on 85.2 million variant sites from the 1000 Genomes project using the following cluster configurations:

  • Driver: r4.2xlarge
  • Workers: i3.8xlarge * 7 (224 cores)
  • Runtime: 2.5 hours

Parameters

The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the SnpEff Annotation notebook. All parameters can be set for all runs or per-run.

Parameter Default Description
manifest n/a Path of csv file describing the input, with file_path and sample_id headers
output n/a The path where pipeline output should be written.
exportVCF false If true, the pipeline writes results in VCF as well as Parquet.
exportVCFAsSingleFile false If true, exports VCF as single file

In addition, you must configure the reference genome using environment variables. To use Grch37, set the environment variable:

refGenomeId=grch37

To use Grch38 instead, set an environment variable like this:

refGenomeId=grch38

Manifest format

The manifest is a CSV file describing where to find the input VCF files. An example:

file_path,sample_id
dbfs:/mnt/vcf/HG001.vcf,HG001
dbfs:/mnt/vcf/HG002.vcf,HG002