Variant Annotation using Pipe Transformer

Important

This documentation has been retired and might not be updated. The product, service, or technology mentioned in this content is no longer supported.

The Databricks Genomics runtime has been deprecated. For open source equivalents, see repos for genomics-pipelines and Glow. Bioinformatics libraries that were part of the runtime have been released as a Docker container, which can be pulled from the ProjectGlow Dockerhub page.

For more information about the Databricks Runtime deprecation policy and schedule, see Supported Databricks runtime releases and support schedule.

Any annotation method can be used on variant data using Glow’s Pipe Transformer.

For example, VEP annotation is performed by downloading annotation data sources (the cache) to each node in a cluster and calling the VEP command line script with the Pipe Transformer using a script similar to the following cell.

import glow
import json

input_vcf = "/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz"
input_df = spark.read.format("vcf").load(input_vcf)
cmd = json.dumps([
  "/opt/vep/src/ensembl-vep/vep",
  "--dir_cache", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96",
  "--fasta", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96/data/human_g1k_v37.fa",
  "--assembly", "GRCh37",
  "--format", "vcf",
  "--output_file", "STDOUT",
  "--no_stats",
  "--cache",
  "--offline",
  "--vcf",
  "--merged"])
output_df = glow.transform("pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header=input_vcf, output_formatter='vcf')
output_df.write.format("delta").save("dbfs:/mnt/vep-pipe")