Pre-packaged VEP annotation pipeline

Setup

Run VEP (release 96) as a Databricks job. The necessary details are:

  • Cluster configuration
    • Databricks Runtime for Genomics
    • For best performance, set the Spark configuration spark.executor.cores 1 and use memory optimized instances with at least 200GB of memory. We recommend r5.8xlarge.
    • To reduce costs, use all spot workers with the Spot fall back to On-demand option selected.
  • Set the task to the VEPPipeline notebook imported into your workspace.

Parameters

The pipeline accepts a number of parameters that control its behavior. All parameters can be set for all runs or per run.

Parameter Default Description
inputVcf n/a Path of the VCF file to annotate with VEP.
output n/a Path where pipeline output should be written.
replayMode skip

One of:

  • skip: if output already exists, stages are skipped.
  • overwrite: existing output is deleted.
exportVCF false If true, pipeline writes results as both VCF and Delta Lake.
extraVepOptions --everything --minimal --allele_number --fork 4 Additional command line options to pass to VEP. Some options are set by the pipeline and cannot be overridden: --assembly, --cache, --dir_cache, --fasta, --format, --merged, --no_stats, --offline, --output_file, --refseq, --vcf. See all possible options on the VEP site.

In addition, you must configure the reference genome and transcripts using environment variables. To use Grch37 with merged Ensembl and RefSeq transcripts, set the environment variable:

refGenomeId=grch37_merged_vep_96

The refGenomeId for all pairs of reference genomes and transcripts are listed below:

  grch37 grch38
Ensembl grch37_vep_96 grch38_vep_96
RefSeq grch37_refseq_vep_96 grch38_refseq_vep_96
Merged grch37_merged_vep_96 grch38_merged_vep_96

LOFTEE

You can run VEP with plugins in order to extend, filter, or manipulate the VEP output. Set up LOFTEE with the following instructions according to the desired reference genome.

grch37

Create a LOFTEE cluster using an init script.

#!/bin/bash
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch master https://github.com/konradjk/loftee.git

We recommend creating a mount point to store any additional files in cloud storage; these files can then be accessed using the FUSE mount. Replace the values in the scripts with your mount point.

If desired, save the ancestral sequence at the mount point.

cd <mount-point>
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.fai
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.gzi

If desired, save the PhyloCSF database at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/phylocsf_gerp.sql.gz
gunzip phylocsf_gerp.sql.gz

When running the VEP pipeline, provide the corresponding extra options.

--dir_plugins /opt/vep/Plugins --plugin LoF,loftee_path:/opt/vep/Plugins/loftee,human_ancestor_fa:<mount-point>/human_ancestor.fa.gz,conservation_file:<mount-point>/phylocsf_gerp.sql

grch38

Create a LOFTEE cluster that can parse BigWig files using an init script.

#!/bin/bash

# Download LOFTEE
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch grch38 https://github.com/konradjk/loftee.git

# Download Kent source tree
mkdir -p /tmp/bigfile
cd /tmp/bigfile
wget https://github.com/ucscGenomeBrowser/kent/archive/v335_base.tar.gz
tar xzf v335_base.tar.gz

# Build Kent source
export KENT_SRC=$PWD/kent-335_base/src
export MACHTYPE=$(uname -m)
export CFLAGS="-fPIC"
export MYSQLINC=`mysql_config --include | sed -e 's/^-I//g'`
export MYSQLLIBS=`mysql_config --libs`
cd $KENT_SRC/lib
echo 'CFLAGS="-fPIC"' > ../inc/localEnvironment.mk
make clean
make
cd ../jkOwnLib
make clean
make

# Install Bio::DB::BigFile
cpanm Bio::Perl
cpanm Bio::DB::BigFile

We recommend creating a mount point to store any additional files in cloud storage; these files can then be accessed using the FUSE mount. Replace the values in the scripts with your mount point.

Save the GERP scores BigWig at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw

If desired, save the ancestral sequence at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.fai
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.gzi

If desired, save the PhyloCSF database at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/loftee.sql.gz
gunzip loftee.sql.gz

When running the VEP pipeline, provide the corresponding extra options.

--dir_plugins /opt/vep/Plugins --plugin LoF,loftee_path:/opt/vep/Plugins/loftee,gerp_bigwig:<mount-point>/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:<mount-point>/human_ancestor.fa.gz,conservation_file:<mount-point>/loftee.sql