Beginning with version 5.2, Databricks Runtime HLS includes a variety of tools for reading, writing, and manipulating variant data.
You can use Spark to read VCF files just like any other file format that Spark supports through the DataFrame API.
df = spark.read.format("com.databricks.vcf").load(path)
The returned DataFrame has a schema that mirrors a single row of a VCF. The path that you provide can be the location of a single file, a directory that contains VCF files, or a Hadoop glob pattern that identifies a group of files. Sample IDs are not included by default. See the parameters table below for instructions on how to include them.
You can control the behavior of the VCF reader with a few parameters. All parameters are case insensitive.
|asADAMVariantContext||boolean||false||If true, rows are emitted in the VariantContext schema from the ADAM project.|
|includeSampleIds||boolean||false||If true, each genotype includes the name of the sample ID it belongs to. This information can be useful, but also increases the size of each row, both in memory and on storage.|
|splitToBiallelic||boolean||false||If true, multiallelic variants are split into two or more biallelic variants.|
After performing some transformations, you can use the DataFrame writer API to save a VCF file.
Each partition of the DataFrame is written to a separate VCF file. If you want the entire DataFrame in a single file, repartition the DataFrame before saving.
Databricks Runtime HLS also provides the ability to read BGEN files, including those distributed by the UK Biobank project.
df = spark.read.format("com.databricks.bgen").load(path)
As with the VCF reader, the provided path can be a file, directory, or glob pattern. If
index files are located in the same directory as the data files, the reader uses the indexes to
more efficiently traverse the data files. Data files can be processed even if indexes do not exist.
The schema of the resulting DataFrame matches that of the VCF reader.
|useBgenIndex||boolean||true||If true, use
This notebook is too large to display inline. Get notebook link.