SnpEff (v4.3) is run as a Databricks job. Most likely, a Databricks solutions architect will set up the initial job for you. The necessary details are:
- The cluster configuration should use Databricks Runtime HLS.
- The task should be the SnpEffAnnotationPipeline notebook that has been imported into your workspace via the link below.
- To reduce costs, use all spot workers with the
Spot fall back to On-demandoption selected.
The pipeline has been tested on 85.2 million variant sites from the 1000 Genomes project using the following cluster configurations:
- Driver: r4.2xlarge
- Workers: i3.8xlarge * 7 (224 cores)
- Runtime: 2.5 hours
The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the SnpEff Annotation notebook. All parameters can be set for all runs or per-run.
|manifest||n/a||Path of csv file describing the input, with file_path and sample_id headers|
|output||n/a||The path where pipeline output should be written.|
|exportVCF||false||If true, the pipeline writes results in VCF as well as Delta Lake.|
|exportVCFAsSingleFile||false||If true, exports VCF as single file|
In addition, you must configure the reference genome using environment variables. To use Grch37, set the environment variable:
To use Grch38 instead, set an environment variable like this:
The manifest is a CSV file describing where to find the input VCF files. An example:
file_path,sample_id dbfs:/mnt/vcf/HG001.vcf,HG001 dbfs:/mnt/vcf/HG002.vcf,HG002