The pipeline is run as a Databricks job. Most likely, a Databricks solutions architect will set up the initial job for you. The necessary details are:
- The task should be the RNASeq notebook provided at the bottom of this page.
- For best performance, use compute optimized instances with at least 60GB of memory. We recommend
- To reduce costs, use all spot workers with the
Spot fall back to On-demandoption selected.
The pipeline accepts a number of parameters that control its behavior. The most important and commonly changed parameters are documented here; the rest can be found in the RNASeq notebook. All parameters can be set for all runs or per-run.
|manifest||n/a||The path of the manifest file describing the input.|
|output||n/a||The path where pipeline output should be written.|
In addition, you must configure the reference genome using environment variables. To use Grch37, set the environment variable:
To use Grch38 instead, set an environment variable like this:
The pipeline consists of two steps:
- Alignment: Map each short read to the reference genome using the STAR aligner.
- Quantification: Count how many reads correspond to each reference transcript.