Create a Genomics Delta Lake Table


The Databricks Spark SQL VCF reader requires Databricks Runtime HLS, which is in Beta. Sign up for access.

Genomics data is usually stored in specialized flat-file formats such as VCF or BGEN.

The notebook below shows how to convert a VCF into a Genomics Delta Lake table using Python (R, Scala, and SQL are also supported) and Databricks Runtime HLS.

Delta Lake tables can be used for second-latency queries, performant range-joins (similar to the single-node bioinformatics tool bedtools intersect), aggregate analyses such as calculating summary statistics, machine learning or deep learning.


We recommend ingesting VCF files into Delta Lake tables once volumes reach >1000 samples, >10 billion genotypes or >1 terabyte.