Pre-packaged VEP annotation pipeline

Important

This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.

The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.

For more information about the Databricks Runtime deprecation policy and schedule, see All supported Databricks Runtime releases.

Setup

Run VEP (release 96) as a Databricks job.

Reference genomes

You must configure the reference genome and transcripts using environment variables. To use GRCh37 with merged Ensembl and RefSeq transcripts, set the environment variable:

refGenomeId=grch37_merged_vep_96

The refGenomeId for all pairs of reference genomes and transcripts are listed:

GRCh37

GRCh38

Ensembl

grch37_vep_96

grch38_vep_96

RefSeq

grch37_refseq_vep_96

grch38_refseq_vep_96

Merged

grch37_merged_vep_96

grch38_merged_vep_96

Parameters

The pipeline accepts a number of parameters that control its behavior. After importing the notebook and setting it as a job task, set these parameters for all runs or per-run.

Parameter

Default

Description

inputVcf

n/a

Path of the VCF file to annotate with VEP.

output

n/a

Path where pipeline output should be written.

replayMode

skip

One of:

  • skip: if output already exists, stages are skipped.

  • overwrite: existing output is deleted.

exportVCF

false

If true, pipeline writes results as both VCF and Delta Lake.

extraVepOptions

--everything --minimal --allele_number --fork 4

Additional command line options to pass to VEP. Some options are set by the pipeline and cannot be overridden: --assembly, --cache, --dir_cache, --fasta, --format, --merged, --no_stats, --offline, --output_file, --refseq, --vcf. See all possible options on the [VEP site][vep site].

LOFTEE

Run VEP with plugins to extend, filter, or manipulate the VEP output. Set up LOFTEE with the following instructions according to the desired reference genome.

grch37

Create a LOFTEE cluster using an init script.

#!/bin/bash
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch master https://github.com/konradjk/loftee.git

Create a mount point to store additional files in cloud storage. See What is the Databricks File System (DBFS)?. Replace the values in the scripts with your mount point.

If desired, save the ancestral sequence at the mount point.

cd <mount-point>
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.fai
wget https://s3.amazonaws.com/bcbio_nextgen/human_ancestor.fa.gz.gzi

If desired, save the PhyloCSF database at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh37/phylocsf_gerp.sql.gz
gunzip phylocsf_gerp.sql.gz

When running the VEP pipeline, provide the corresponding extra options.

--dir_plugins /opt/vep/Plugins --plugin LoF,loftee_path:/opt/vep/Plugins/loftee,human_ancestor_fa:<mount-point>/human_ancestor.fa.gz,conservation_file:<mount-point>/phylocsf_gerp.sql

grch38

Create a LOFTEE cluster that can parse BigWig files using an init script.

#!/bin/bash

# Download LOFTEE
DIR_VEP_PLUGINS=/opt/vep/Plugins
mkdir -p $DIR_VEP_PLUGINS
cd $DIR_VEP_PLUGINS
echo export PERL5LIB=$PERL5LIB:$DIR_VEP_PLUGINS/loftee >> /databricks/spark/conf/spark-env.sh
git clone --depth 1 --branch grch38 https://github.com/konradjk/loftee.git

# Download Kent source tree
mkdir -p /tmp/bigfile
cd /tmp/bigfile
wget https://github.com/ucscGenomeBrowser/kent/archive/v335_base.tar.gz
tar xzf v335_base.tar.gz

# Build Kent source
export KENT_SRC=$PWD/kent-335_base/src
export MACHTYPE=$(uname -m)
export CFLAGS="-fPIC"
export MYSQLINC=`mysql_config --include | sed -e 's/^-I//g'`
export MYSQLLIBS=`mysql_config --libs`
cd $KENT_SRC/lib
echo 'CFLAGS="-fPIC"' > ../inc/localEnvironment.mk
make clean
make
cd ../jkOwnLib
make clean
make

# Install Bio::DB::BigFile
cpanm --notest Bio::Perl
cpanm --notest Bio::DB::BigFile

Create a mount point to store any additional files in cloud storage. See What is the Databricks File System (DBFS)?. Replace the values in the scripts with your mount point.

Save the GERP scores BigWig at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/gerp_conservation_scores.homo_sapiens.GRCh38.bw

If desired, save the ancestral sequence at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.fai
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/human_ancestor.fa.gz.gzi

If desired, save the PhyloCSF database at the mount point.

cd <mount-point>
wget https://personal.broadinstitute.org/konradk/loftee_data/GRCh38/loftee.sql.gz
gunzip loftee.sql.gz

When running the VEP pipeline, provide the corresponding extra options.

--dir_plugins /opt/vep/Plugins --plugin LoF,loftee_path:/opt/vep/Plugins/loftee,gerp_bigwig:<mount-point>/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:<mount-point>/human_ancestor.fa.gz,conservation_file:<mount-point>/loftee.sql