Sample Quality Control

You can calculate quality control statistics on your sample data using Apache Spark SQL functions, which can be expressed in Python, R, Scala, or SQL.

Each of these functions returns a map from sample ID to a struct containing metrics for that sample and assumes that the same samples appear in the same order in each row.

Functions Arguments Return
sample_call_summary_stats referenceAllele string, alternateAlleles array of strings, genotypes array calls

A struct containing the following summary stats:

  • callRate: The fraction of variants where this sample has a called genotype. Equivalent to nCalled / (nCalled + nUncalled).
  • nCalled: The number of variants where this sample has a called genotype.
  • nUncalled: The number of variants where this sample does not have a called genotype.
  • nHomRef: The number of variants where this sample is homozygous reference.
  • nHet: The number of variants where this sample is heterozygous.
  • nHomVar: The number of variants where this sample is homozygous non reference.
  • nSnv: The number of calls where this sample has a single nucleotide variant. This value is the sum of nTransition and nTransversion.
  • nInsertion: Insertion variant count.
  • nDeletion: Deletion variant count.
  • nTransition: Transition count.
  • nTransversion: Transversion count.
  • nSpanningDeletion: The number of calls where this sample has a spanning deletion.
  • rTiTv: Ratio of transitions to tranversions (nTransition / nTransversion).
  • rInsertionDeletion: Ratio of insertions to deletions (nInsertion / nDeletion).
  • rHetHomVar: Ratio of heterozygous to homozygous variant calls (nHet / nHomVar).
sample_dp_summary_stats genotypes array with a depth field. A struct with min, max, mean, and stddev.
sample_gq_summary_stats. genotypes array with a conditionalQuality field. A struct with min, max, mean, and stddev.

Sample quality control notebook