Running a Workflow

Martin Morgan, Kayla Interdonato

How to configure and run a workflow, based on the Bioconductor-Workflow-DESeq2 workspace. The workflow starts with FASTQ files and transforms them using salmon to the inputs required for Bioconductor DESeq2 analysis of differential expression.

Notes

Visit the course schedule for links to the recorded session, and to other workshops in the series.

The material below requires a billing account. We provide a billing account during the workshop, but if you're following along on your own see 'Next Steps' for how to create a billing account.

Access to the workspaces we use may require registration; please sign up with your AnVIL email address.

Learning Objectives

This week we'll configure and run a workflow, based on the Bioconductor-Workflow-DESeq2 workspace (access required via course registration or email to mtmorgan.bioc at gmail.com). The workspace allows a complete bulk RNASeq differential expression analysis. The workflow starts with FASTQ files and transforms them using salmon to the inputs required for Bioconductor DESeq2 analysis of differential expression. Notebooks describe how the workspace was set up (so that it can be tailored to individual analyses) and also how the outputs of a successful workflow can be marshaled as inputs to an interactive DESeq2 analysis.

Key Resources

Visit https://anvil.terra.bio to use the AnVIL platform.
The salmon website and excellent DESeq2 vignette form the foundation for the workspace.
Completed workflows are run in the cloud, but components are often developed locally using Cromwell. Workflows are written in the Workflow Description Language (WDL).
Workflow tasks often make use of docker to create containers with the software necessary for the task.

Review

Previously...

The course schedule contains links and videos of previous sessions.

Essential steps

Login
Workspaces
Billing accounts
(R-based) Jupyter notebooks or RStudio for interactive analysis
Workflows for large-scale data processing

Cloud computing environment

Runtime and persistent disk.
Workspace DATA and buckets.
AnVIL package for interaction with workspace components.

FAQ answers

We upload workflows through GitHub / Dockstore, but also the Broad Methods Repository (YouTube); see also the WDL Puzzles workspace.
Default name and namespace -- the runtime starts in a particular workspace, and the runtime knows the default namespace and name. So by default, I had
```
> avworkspace()
[1] "deeppilots-bioconductor-may3/Bioconductor-Workshop-PopUp-mtmorgan"
```
gsutil_cp(): CommandException: Downloading this composite object requires integrity checking with CRC32c, but your crcmod installation isn’t using... This is a bug that will be fixed in the underlying image for the runtime.

Workshop Activities

Setup & Tour

Setup

Log in to AnVIL using the email address you used to register for the course and navigate (via the HAMBURGER) to Workspaces.
Clone the Bioconductor-Workflow-DESeq2 workspace
- Unique workspace name
- Billing project: deeppilots-bioconductor-may10

Workspace tour

DATA
- participant, participant_set TABLES
- Files in google bucket contain only notebooks
NOTEBOOK
- How to use all aspects of the workspace for bulk RNASeq differential expression analysis.
WORKFLOW