Working with Variant Data

Beginning with version 5.2, Databricks Runtime HLS includes a variety of tools for reading, writing, and manipulating variant data.

VCF

You can use Spark to read VCF files just like any other file format that Spark supports through the DataFrame API.

df = spark.read.format("com.databricks.vcf").load(path)

The returned DataFrame has a schema that mirrors a single row of a VCF. The path that you provide can be the location of a single file, a directory that contains VCF files, or a Hadoop glob pattern that identifies a group of files. Sample IDs are not included by default. See the parameters table below for instructions on how to include them.

You can control the behavior of the VCF reader with a few parameters. All parameters are case insensitive.

Parameter Type Default Description
asADAMVariantContext boolean false If true, rows are emitted in the VariantContext schema from the ADAM project.
includeSampleIds boolean false If true, each genotype includes the name of the sample ID it belongs to. This information can be useful, but also increases the size of each row, both in memory and on storage.
splitToBiallelic boolean false If true, multiallelic variants are split into two or more biallelic variants.

After performing some transformations, you can use the DataFrame writer API to save a VCF file.

df.write.format("com.databricks.vcf").save(path)

Each partition of the DataFrame is written to a separate VCF file. If you want the entire DataFrame in a single file, repartition the DataFrame before saving.

df.repartition(1).write.format("com.databricks.vcf").save(path)

BGEN

Databricks Runtime HLS also provides the ability to read BGEN files, including those distributed by the UK Biobank project.

df = spark.read.format("com.databricks.bgen").load(path)

As with the VCF reader, the provided path can be a file, directory, or glob pattern. If .bgi index files are located in the same directory as the data files, the reader uses the indexes to more efficiently traverse the data files. Data files can be processed even if indexes do not exist. The schema of the resulting DataFrame matches that of the VCF reader.

Parameter Type Default Description
useBgenIndex boolean true If true, use .bgi index files.

Example notebook

This notebook is too large to display inline. Get notebook link.