Variant Annotation using Pipe Transformer

Important

This documentation has been retired and might not be updated. The products, services, or technologies mentioned in this content are no longer supported.

The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.

For more information about the Databricks Runtime deprecation policy and schedule, see All supported Databricks Runtime releases.

Use any annotation method on variant data using Glow’s Pipe Transformer.

For example, VEP annotation is performed by downloading annotation data sources (the cache) to each node in a cluster and calling the VEP command line script with the Pipe Transformer using a script similar to the following cell.

import glow
import json

input_vcf = "/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz"
input_df = spark.read.format("vcf").load(input_vcf)
cmd = json.dumps([
  "/opt/vep/src/ensembl-vep/vep",
  "--dir_cache", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96",
  "--fasta", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96/data/human_g1k_v37.fa",
  "--assembly", "GRCh37",
  "--format", "vcf",
  "--output_file", "STDOUT",
  "--no_stats",
  "--cache",
  "--offline",
  "--vcf",
  "--merged"])
output_df = glow.transform("pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header=input_vcf, output_formatter='vcf')
output_df.write.format("delta").save("dbfs:/mnt/vep-pipe")