Beginning with version 5.2, Databricks Runtime HLS includes a variety of tools for reading, writing, and manipulating variant data.
This topic uses the terms “variant” or “variant data” to refer to single nucleodite variants and short indels.
You can use Spark to read VCF files just like any other file format that Spark supports through the DataFrame API.
df = spark.read.format("com.databricks.vcf").load(path)
The returned DataFrame has a schema that mirrors a single row of a VCF. Information that applies to an entire variant (SNV or indel), like the contig name, start and end positions, and INFO attributes, is contained in columns of the DataFrame. The genotypes, which correspond to the GT FORMAT fields in a VCF, are contained in an array with one entry per sample. Each entry is a struct with fields that are described in the VCF header.
The path that you provide can be the location of a single file, a directory that contains VCF files, or a Hadoop glob pattern that identifies a group of files. Sample IDs are not included by default. See the parameters table below for instructions on how to include them.
You can control the behavior of the VCF reader with a few parameters. All parameters are case insensitive.
|asADAMVariantContext||boolean||false||If true, rows are emitted in the VariantContext schema from the ADAM. project.|
|includeSampleIds||boolean||false||If true, each genotype includes the name of the sample ID it belongs to. Sample names increases the size of each row, both in memory and on storage.|
|splitToBiallelic||boolean||false||If true, multiallelic variants are split into two or more biallelic variants.|
|flattenInfoFields||boolean||false||If true, each info field in the input VCF will be converted into a column in the output DataFrame with each column typed as specified in the VCF header. If false, all info fields will be contained in a single column with a string -> string map of info keys to values.|
You can use the DataFrameWriter API to save a VCF file, which you can then read with other tools.
Each partition of the DataFrame is written to a separate VCF file. If you want the entire DataFrame in a single file, repartition the DataFrame before saving.
To control the behavior of the VCF writer, you can provide the following option:
|compression||string||n/a||A compression codec to use for the output VCF file. The value can be the full name of a compression codec class (e.g.,
Databricks Runtime HLS also provides the ability to read BGEN files, including those distributed by the UK Biobank project.
df = spark.read.format("com.databricks.bgen").load(path)
As with the VCF reader, the provided path can be a file, directory, or glob pattern. If
index files are located in the same directory as the data files, the reader uses the indexes to
more efficiently traverse the data files. Data files can be processed even if indexes do not exist.
The schema of the resulting DataFrame matches that of the VCF reader.
|useBgenIndex||boolean||true||If true, use
You can calculate quality control statistics on your variant data using SparkSQL functions built into the HLS runtime.
||A struct with two elements: the expected heterozygous frequency according to Hardy-Weinberg equilibrium and the associated p-value.|
A struct containing the following summary stats:
||A struct containing the min, max, mean, and sample standard deviation for genotype depth (DP in VCF v4.2 specificiation) across all samples|
||A struct containing the min, max, mean, and sample standard deviation for genotype quality (GQ in VCF v4.2 specification) across all samples|