Pre-packaged SnpEff annotation pipeline


Run SnpEff (v4.3) as a Databricks job. Most likely, a Databricks solutions architect will set up the initial job for you. The necessary details are:


The pipeline has been tested on 85.2 million variant sites from the 1000 Genomes project using the following cluster configurations:

  • Driver: r4.2xlarge
  • Workers: i3.8xlarge * 7 (224 cores)
  • Runtime: 2.5 hours


The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the SnpEff Annotation pipeline notebook. All parameters can be set for all runs or per-run.

Parameter Default Description
inputVariants n/a Path of input variants (VCF or Delta Lake).
output n/a The path where pipeline output should be written.
exportVCF false If true, the pipeline writes results in VCF as well as Delta Lake.
exportVCFAsSingleFile false If true, exports VCF as single file

In addition, you must configure the reference genome using environment variables. To use Grch37, set the environment variable:


To use Grch38 instead, set an environment variable like this:



The annotated variants are written out to Delta tables inside the provided output directory. If you configured the pipeline to export to VCF, they’ll appear under the output directory as well.

    |---Delta files

SnpEff annotation pipeline notebook

Open notebook in new tab