VEP (release 96) is run as a Databricks job. The necessary details are:
- The cluster configuration should use Databricks Runtime HLS.
- Set the task as the VEPPipeline notebook, which you can import into your workspace via the link below.
- For best performance, set the Spark configuration
spark.executor.cores 1and use memory optimized instances with at least 200GB of memory. We recommend
- To reduce costs, use all spot workers with the
Spot fall back to On-demandoption selected.
The pipeline accepts a number of parameters that control its behavior. All parameters can be set for all runs or per run.
|inputVcf||n/a||Path of the VCF file to annotate with VEP.|
|output||n/a||Path where pipeline output should be written.|
|exportVCF||false||If true, pipeline writes results as both VCF and Delta Lake.|
||Additional command line options to pass to VEP. Some options are set by the pipeline and cannot be overridden:
In addition, you must configure the reference genome and transcripts using environment variables. To use Grch37 with merged Ensembl and RefSeq transcripts, set the environment variable:
refGenomeId for all pairs of reference genomes and transcripts are listed below: