The R / Bioconductor AnVIL Package
Martin Morgan, Nitesh Turaga
An exploration of how workspaces provide a framework for managing data and large-scale analyses using the HCA Optimus Pipeline and 1000G-high-coverage-2019 workspaces and R using the AnVIL package.
- Visit the course schedule for links to the recorded session, and to other workshops in the series.
- The material below requires a billing account. We provide a billing account during the workshop, but if you're following along on your own see 'Next Steps' for how to create a billing account.
- Access to the workspaces we use may require registration; please sign up with your AnVIL email address.
This week we'll explore how workspaces provide a framework for managing data and large-scale analyses. We use the HCA Optimus Pipeline and 1000G-high-coverage-2019 package.
- Visit https://anvil.terra.bio to use the AnVIL platform.
- We use week-2-demo.R to guide us through this workshop.
- We use the HCA Optimus Pipeline and 1000G-high-coverage-2019 workspaces as examples.
- Review the Introduction to the AnVIL package vignette.
- Notes and recorded session: Using R / Bioconductor in AnVIL
- Billing accounts
- Cloud environment -- (R-based) Jupyter notebooks or RStudio
Cloud Computing Environment
- Runtime and persistent disk
- A 'personal' cloud computing environment
- Not shared with others
- Persistent disk mounted at
- R / Jupyter:
- R / Jupyter:
- Startup script or custom docker file for 'sudo'-like access, and for complete reproducibility
- Log in to AnVIL using the email address you used to register for the course and navigate (via the HAMBURGER) to Workspaces.
If you cloned the Bioconductor-Workshop-Popup workspace last week, delete it now.
Clone the Bioconductor-Workshop-Popup.
Start an RStudio cloud environment.
Launch the cloud environment.
- Copy the week-2-demo.R script into a file on your cloud environment.
In a new browser tab/window, navigate (via the HAMBURGER) to the HCA Optimus Pipeline workspace. This workspace demonstrates how scRNA-seq fastq files can be transformed to a 'count matrix' for interactive analysis.
Overall orientation: DATA TABLES serve as input to WORKFLOWS (scalable 'big data' computation).
Workflows transform big data using 'Workflow Description Language' scripts producing outputs (logs, results). For this workflow:
- Single-cell RNA seq analysis.
- Inputs are fastq files from individual samples.
- Scripts perform alignment, UMI processing, creating a 'count' matrix of gene x cell (sample) expression matrices, etc.
- Primary output of interest is a 'loom' file summarizing the count matrix.
Workspace bucket / Files store workflow outputs (each workflow run has a unique identifier; logs and results are located under the identifier). Buckets also provide a location for storing and sharing interactive analysis results.
The AnVIL Package
hca = "featured-workspaces-hca/HCA_Optimus_Pipeline" thousand_genomes = "anvil-datastorage/1000G-high-coverage-2019" library(AnVIL) avworkspace() # current workspace avworkspace(hca) # set to HCA workspace
DATA TABLE Access
avtables() tbl = avtable("sample") tbl tbl %>% count(participant) ## tbl %>% avtable_import() avworkspace(thousand_genomes) avtables() participant = avtable("participant") participant participant %>% count(POPULATION, sort = TRUE) avtable("pedigree") %>% count(Population, Sex) %>% tidyr::pivot_wider(names_from = "Sex", values_from = "n") ## switch back to this workspace avworkspace(hca)
## Copy files from google buckets to persistent disk tbl = avtable("sample_set") tbl dir.create("~/loom") gsutil_cp(tbl$loom_output_file, "~/loom/") # see also gsutil_rsync() dir("~/loom") ## Workspace Bucket -- 'backup' or share persistent disk to workspace bucket avbucket() # bucket associated with this workspace gsutil_ls(avbucket()) avfiles_backup("~/scripts", recursive = TRUE) # see also avfiles_restore() gsutil_ls(avbucket(), recursive = TRUE)
Fast Binary Package Installation
## do NOT update out-of-date packages yet BiocManager::install("Bioconductor/AnVIL") ## RESTART R AnVIL::repositories() # binary Bioconductor and CRAN package installation ## install and use LoomExperiment AnVIL::install("LoomExperiment") # about 40 seconds, rather than 10's of minutes sce = LoomExperiment::import("~/loom/pbmc_human_v3.loom")
Access AnVIL from Outside AnVIL
- Requires gcloud SDK installed on your computer.
Use SDK to register your Gmail account and google billing project.
Access the AnVIL 'API'
leo = Leonardo() leo leo$listDisks() terra = Terra() tags(terra, "Workspaces") wkspc = terra$listWorkspaces() %>% flatten() %>% select(-starts_with("workspace.attributes")) wkspc
What You've Accomplished
- Clone a workspace, launch an RStudio cloud environment
- Navigate between workspaces
- Elements of workflow structure -- DATA TABLE inputs, scripts, File outputs
- Selecting workspaces
- Managing DATA TABLEs
- Moving data to and from google buckets
- Fast binary package installation (in the 'devel' version of the package)
- Advanced features, e.g., local use, API access
- Follow instructions at Set up billing with $300 Google credits to explore Terra to enable billing for your own projects.
Frequently Asked Questions
- Uploading workflows -- through GitHub / Dockstore, but also the Broad Methods Repository (YouTube); see also the WDL Puzzles workspace.
- Default name and namespace -- the runtime starts in a particular workspace, and the runtime knows the default namespace and name. So by default, I had
> avworkspace()  "deeppilots-bioconductor-may3/Bioconductor-Workshop-PopUp-mtmorgan"
gsutil_cp(): CommandException: Downloading this composite object requires integrity checking with CRC32c, but your crcmod installation isn’t using...This is a bug that should be fixed in the underlying image for the runtime.