Variant Annotation using Pipe Transformer


Databricks Runtime for Genomics is deprecated. Databricks is no longer building new Databricks Runtime for Genomics releases and will remove support for Databricks Runtime for Genomics on September 24, 2022, when Databricks Runtime for Genomics 7.3 LTS support ends. At that point Databricks Runtime for Genomics will no longer be available for selection when you create a cluster. For more information about the Databricks Runtime deprecation policy and schedule, see Supported Databricks runtime releases and support schedule. Bioinformatics libraries that were part of the runtime have been released as Docker Containers, which you can find on the ProjectGlow Dockerhub page.

Any annotation method can be used on variant data using Glow’s Pipe Transformer.

For example, VEP annotation is performed by downloading annotation data sources (the cache) to each node in a cluster and calling the VEP command line script with the Pipe Transformer using a script similar to the following cell.

import glow
import json

input_vcf = "/databricks-datasets/hail/data-001/1kg_sample.vcf.bgz"
input_df ="vcf").load(input_vcf)
cmd = json.dumps([
  "--dir_cache", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96",
  "--fasta", "/mnt/dbnucleus/dbgenomics/grch37_merged_vep_96/data/human_g1k_v37.fa",
  "--assembly", "GRCh37",
  "--format", "vcf",
  "--output_file", "STDOUT",
output_df = glow.transform("pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header=input_vcf, output_formatter='vcf')