The following library versions are packaged in Databricks Runtime 7.0 for Genomics. For libraries included in lower versions of Databricks Runtime for Genomics, see the release notes.
The pipeline is run as a Databricks job. Most likely, a Databricks solutions architect will set up the initial job for you. The necessary details are:
- The task should be the RNASeq notebook provided at the bottom of this page.
- For best performance, use compute optimized instances with at least 60GB of memory. We recommend c5.9xlarge.
- To reduce costs, use all spot workers with the Spot fall back to On-demand option selected.
The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the RNASeq notebook. All parameters can be set for all runs or per-run.
|manifest||n/a||The manifest describing the input.|
|output||n/a||The path where pipeline output should be written.|
|perSampleTimeout||12h||A timeout applied per sample. After reaching this timeout, the pipeline continues on to the next sample. The value of this parameter must include a timeout unit: ‘s’ for seconds, ‘m’ for minutes, or ‘h’ for hours. For example, ‘60m’ will result in a timeout of 60 minutes.|
In addition, you must configure the reference genome using environment variables. To use Grch37, set the environment variable:
To use Grch38 instead, set an environment variable like this:
The pipeline consists of two steps:
- Alignment: Map each short read to the reference genome using the STAR aligner.
- Quantification: Count how many reads correspond to each reference transcript.
The operational aspects of the RNASeq pipeline are very similar to the DNASeq pipeline. For more information about manifest format, output structure, programmatic usage, and common issues, see DNASeq pipeline.