AnVIL Portal
  • Introduction
  • Getting Started
  • Guides and Tutorials
  • Introduction to Terra
  • Introduction to Dockstore
  • Understanding Cloud Costs
  • Account Setup
  • Overview of Account Setup
  • Obtaining a Google ID
  • Creating a Terra Account
  • Billing Setup
  • Overview of Billing Concepts
  • Creating a Google Cloud Billing Account
  • Accessing Data
  • Discovering Data
  • Requesting Data Access
  • Data Access Controls
  • Bringing Your Own Data
  • Running Analysis Workflows
  • Using Example Workspaces
  • Running GATK in Terra
  • Running Galaxy Workflows from Dockstore
  • Running Interactive Analyses
  • Running Jupyter Notebooks in AnVIL
  • Running R / Bioconductor in AnVILL
  • Running Galaxy in AnVIL
  • MOOC
  • What is AnVIL?
  • Cloud Computing
  • Cloud Costs
  • Use Case: GATK
  • Use Case: GWAS
  • Use Case: eQTL
  • Video Gallery
  • Anvil
  • Terra
  • Dockstore
  • Galaxy
  • Seqr
  • Workshop Archive
  • Workshop Archive
  • Reference
  • Cross Platform Data Access with GA4GH DRS in Terra

Getting Started with AnVIL

The AnVIL platform is an NHGRI -supported data commons running on the Google Cloud Platform (GCP). AnVIL enables researchers to analyze high-value open and controlled access genomic datasets with popular analysis tools in a secure cloud computing environment.

AnVIL uses Terra as its analysis platform, AnVIL Data Explorer for data search and artificial cohort creation, and Dockstore as a repository for Docker-based genomic analysis tools and workflows.

In addition to Docker-based analysis workflows, AnVIL supports popular interactive analysis tools such as Jupyter notebooks, Bioconductor, RStudio, and Galaxy.

By operating in the cloud, AnVIL users can scale analyses from a single computer to thousands and securely share data, workflows, and reproducible results with collaborators and colleagues.

About AnVIL’s Documentation

AnVIL’s training materials curate and augment existing component and tool documentation and show how to use AnVIL’s parts together to accomplish the goals of AnVIL’s different user personas.

To complement this onboarding and introductory section, the AnVIL team is in the process of developing persona-specific guides and tutorials. For example, see the guides for data analysts, investigators, developers, instructors, and data contributors.

New User Onboarding

The following is a guided walk-through of the AnVIL / Terra documentation with a focus on onboarding and preparing new users to run genomic analyses in the cloud.

This section covers:

  1. Setting up and linking user accounts.
  2. Obtaining access to AnVIL data.
  3. An overview of Terra workspaces.
  4. An overview of cloud compute costs and setting up billing.

Setting Up and Linking User Accounts

All you need is a Google account to register with Terra and browse AnVIL’s publicly accessible workspaces.

Likewise, with a Google account, you can register with the AnVIL Data Explorer and browse publicly accessible datasets or register with Dockstore and browse tools and workflows.

To send artificial cohorts to Terra for Analysis, you must use the same Google ID for your Data Explorer and Terra accounts.

To allow your dbGaP data request approvals to flow through to Terra and the Data Explorer, you will need to link your eRA Commons ID with both platforms.

Obtaining Access to AnVIL Data

AnVIL holds genomic data for hundreds of thousands of study participants. Much of this data is controlled access.

To obtain access to controlled-access data sets, you must either be a member of a data-generating consortium with a data-sharing agreement among consortium members or have been granted access to a study through the dbGaP Data Access Request process.

Once you have been granted access, and assuming you have linked your eRA commons ID with Terra, you will be able to see your new studies in the Data Explorer and new data-oriented workspaces in Terra.

AnVIL’s open-access datasets, such as 1000 Genomes High Coverage 2019 can be accessed in Terra or the Data Explorer immediately after account creation.

For a detailed listing of available datasets searchable by disease, data type, consent type, and consortia, see AnVIL’s Dataset Catalog.

Analyzing Data in Terra Workspaces

In Terra, you use workspaces to configure and run analyses and share results. Terra workspaces typically hold genomic data and subject-level phenotypic and sample processing data and are configured with analysis tools such as notebooks and Docker images. Workspaces can also save the output generated by running an analysis with a workspace’s associated “cloud environment.”

Terra workspaces support interactive analysis with Jupyter Notebooks, Bioconductor, and Galaxy. Terra workspaces can also run Docker containerized workflows written in WDL.

In general, to perform an analysis in a workspace, you set up the data and workflows you require and then launch a cloud environment to execute the analysis over the data and write out results to the workspace storage bucket.

You may start with a blank workspace, but typically, you will start by cloning a workspace containing the data or the analysis you require.

Workspace Types

There are several types of workspaces to consider when thinking about cloning a workspace to start your project.

Data-Oriented Workspaces - These workspaces hold data for AnVIL open or controlled-access data sets or cohorts exported from the Data Explorer. They may contain documentation in the dashboard about the study that generated the data set and data tables holding sample and subject phenotypic metadata with links to the genomic data files.

Analysis-Oriented Workspaces - Analysis-oriented workspaces showcase a specific analysis or tool such as Hail or Bioconductor.

Example Workspaces - The example workspaces, also referred to as “Featured” workspaces, are educational tutorial workspaces demonstrating collections of best practices in analysis and reproducible science. For an example, see Reproducing the paper: Variant analysis of Tetralogy of Fallot.

Workspace Composition

A Terra workspace consists of the following:

  1. A Dashboard - for holding markdown documentation about the workspace.
  2. A Cloud Storage Bucket for holding data files, notebooks, and analysis output. Typically, this bucket is configured as “requester pays,” meaning that users downloading from the bucket pay cloud egress fees.
  3. Data Tables - for holding participant or sequencing metadata. For example, is it common to have a set of “Participant” tables and a set of “Sample Tables”. Participant tables hold one row per participant with phenotypic data, e.g., gender, age, relevant diseases, etc. Sample tables with one row per sample typically hold information about the sample sequencing process and metadata. Sample tables also commonly link to the genomic data derived from the sample.
  4. Reference Data - for holding links to a reference genome or other reference data such as hg38.
  5. Workspace Data - for holding additional key-value data pairs used for configuring the workspace.
  6. Cloud Environments - for executing the workspace’s interactive analysis or workflows. Cloud environments may consist of a single machine or cluster of machines and be configured with various amounts of RAM and persistent disk. Cloud environments may be in a running or stopped state. Note that even in the stopped state, cloud environments may continue to incur charges, for example, for persistent disk space allocated.
  7. A Terra Billing Project - for specifying the Google Cloud Billing Account charged for GCP cloud compute costs incurred by the workspace. When Terra Billing Projects are created, they are linked to a Google Cloud Billing Account. When a workspace is created, it is linked with a Terra Billing Project and thereby, to a Google Cloud Billing Account.
  8. Permissions for controlling who can view, clone, update, or share a workspace and who can launch cloud environments in the workspace.
  9. Authorization Domains - for controlling who can access a workspace’s data. When a workspace is created, it can be associated with zero or more authorization domains. Once a workspace is created, its authorization domains can not be modified. When workspaces are cloned, the clone inherits all of the authorization domains on the original workspace. At the time of cloning, it is possible to add additional (but not remove) authorization domains. Members wishing to access the workspace’s data must be members of all of the workspace's authorization domains.

Workspace Actions

Basic actions that can be performed on workspaces are:

Create - Members of Terra Billing Projects can create their own workspace from scratch and associate their Terra Billing Project with the workspace.

Clone - Terra Billing Project Members can also clone an existing workspace. Cloning a workspace copies its data and notebooks while possibly changing its Terra Billing Project or adding authorization domains.

Launch - Users with “can-compute” permissions on a workspace can configure and launch cloud environments in the workspace to analyze the workspace’s data. Cloud costs for the launched environments will be passed through to the Google Billing Account associated with the workspace’s Terra Billing Project.

Share - Users with “can-share” permissions on a workspace can share the workspace and allow others to read and potentially update, launch, and share it.

Workspace Permissions

If you are an Owner, Writer, or Reader of a workspace, Terra displays the workspace in your “Workspaces List.”

You may also have can-share or can-compute permissions depending on your role and the permissions you were granted when the workspace was shared with you. The possible workspace permissions are listed below by role.

RoleCan ReadCan ModifyCan ComputeCan Share
OwnerYesYesYesYes
WriterYesYesSet when shared.Set when shared.
ReaderYesNoNoSet when shared.

Owner - If you created a workspace, you are the workspace’s Owner and can read, modify, share, and execute the workspace. When sharing workspaces with Readers, you can allow them to share with other readers. When sharing workspaces with Writers, you can enable them to execute or share with other writers and readers. Workspace owners can also change the workspace’s Terra Billing Project.

Writer - If you have “Writer” access to a workspace, you can read and modify the workspace. The person who shared the workspace with you may also have allowed you to execute the workspace by giving you can-compute privileges or to share the workspace by giving you can-share privileges.

Reader - If you have “Reader” access to a workspace, you can see the workspace in your workspace list and view the workspace’s dashboard. The person who shared the workspace with you may also have allowed you to share the workspace with other readers by giving you can-share privileges.

Can-compute - Writers may be given “can-compute” privileges allowing them to launch cloud environments.

Can-share - Readers or Writers may be given “can-share” privileges, allowing them to share the workspace with others.

In general, if you can share a workspace, you can give the new user the same permissions you have or less.

Workspaces and Cloud Costs

AnVIL and all of its components are free to use; however, as Terra runs on the Google Cloud Platform (GCP), certain workspace activities, such as running an analysis, storing analysis results, or downloading data, incur Google Cloud Platform (GCP) fees.

Performing the following workspace activities will incur costs on GCP that will be passed through to the workspace’s Terra Billing Project’s Google Cloud Billing Account:

  1. Uploading data to the workspace bucket - the upload network transfer or ingress is free; however, there will be a GCP fee for storing the data in the bucket over time.
  2. Launching a Cloud Environment - The charges will depend on the type of machine and number of processors selected as well as any disk or RAM space used. This is also referred to as “Launching a Workspace.”
  3. Storage for persistent disk associated with any running or paused cloud environments.
  4. Storage for notebooks as these are saved in the workspace’s Cloud Storage bucket.
  5. Downloading data from the workspace’s Cloud Storage bucket unless this bucket is configured to be “requester pays”. For requester pays buckets, users must select their own Terra Billing Project to pay for the GCP egress fees.

Setting Up Cloud Billing

Setting up Billing as an Individual

Setting up GCP billing as an individual is a good way for all users to get started with the platform as Google funds new accounts with $300 in free cloud cost fees.

To set up GCP billing as an individual, the general process is as follows:

  1. Create a Google Cloud Account and set up a payment method. Be sure to create the Google Cloud Account using the same Google ID (email address) you use for your Terra account.
  2. Create a Google Billing Account and link it to Terra by adding terra-billing@terra.bio as a Billing Account User to the account.
  3. Set up a GCP Billing Account Budget and appropriate alerts.
  4. In Terra, create a Terra Billing Account and use it to create or clone workspaces and pay for any compute, storage, or egress fees.

If you plan to share your Terra Billing Project or a workspace with others, be sure you (and they) have a basic understanding of cloud costs and how cloud costs flow through to the workspace‘s (and not the user’s) Terra Billing Account.

Setting up Billing for a Lab

Setting up cloud cost billing for a lab is similar, except that you will need to plan out your account setup to aid the appropriate assignment of expenses to funding sources, and to enable cloud cost reporting, budgets, and alerts to the appropriate granularity.

Budgets and alerts are set at the Terra Billing Project level so you may end up having a Terra Billing Project per lab member and per shared workspace.

You will also want to deliberate in your planning about who can-share Terra Billing Accounts and Terra workspaces with can-compute permissions. For example, you may assign a lab manager who creates workspaces for users and allows them to execute but not share the workspace.

Getting Help

See Getting Help for more information on how to obtain support for AnVIL’s components and tools.


Help us make these docs great!
All AnVIL docs are open source. See something that’s wrong or unclear? Submit a pull request.
Make a contribution