VEP Annotation Pipeline

Beta

The Databricks VEP annotation pipeline requires Databricks Runtime HLS, which is in Beta. Interfaces and pricing are subject to change before general availability.

Setup

VEP (release 96) is run as a Databricks job. The necessary details are:

  • The cluster configuration should use Databricks Runtime HLS.
  • Set the task as the VEPPipeline notebook, which you can import into your workspace via the link below.
  • For best performance, set the Spark configuration spark.executor.cores 1 and use memory optimized instances with at least 200GB of memory. We recommend r5.8xlarge.
  • To reduce costs, use all spot workers with the Spot fall back to On-demand option selected.

Parameters

The pipeline accepts a number of parameters that control its behavior. All parameters can be set for all runs or per run.

Parameter Default Description
inputVcf n/a Path of the VCF file to annotate with VEP.
output n/a Path where pipeline output should be written.
replayMode skip

One of:

  • skip: stages are skipped if output already exists.
  • overwrite: existing output is deleted.
exportVCF false If true, pipeline writes results as both VCF and Delta Lake.
extraVepOptions --everything --minimal --allele_number --fork 4 Additional command line options to pass to VEP. Some options are set by the pipeline and cannot be overridden: --assembly, --cache, --dir_cache, --fasta, --format, --merged, --no_stats, --offline, --output_file, --refseq, --vcf. See all possible options on the VEP site.

In addition, you must configure the reference genome and transcripts using environment variables. To use Grch37 with merged Ensembl and RefSeq transcripts, set the environment variable:

refGenomeId=grch37_merged_vep_96

The refGenomeId for all pairs of reference genomes and transcripts are listed below:

  grch37 grch38
Ensembl grch37_vep_96 grch38_vep_96
RefSeq grch37_refseq_vep_96 grch38_refseq_vep_96
Merged grch37_merged_vep_96 grch38_merged_vep_96

VEP annotation pipeline