One of the most widely-used resources for studying the relationship between genetic variation and gene expression is the Genotype-Tissue Expression (GTEx) project. Established by the NIH Common Fund in 2010, the recent GTEx V8 dataset represents the largest atlas of human gene expression and corresponding trait loci to date (dbGaP accession: phs000424.v8.p2).
This dataset contains genotype data from 838 postmortem donors and 17,382 RNA-seq samples across 54 tissue sites and 2 cell lines. The GTEx Portal provides uniformly processed gene expression data, a QTL Browser, and a mechanism by which to request available biospecimens to allow researchers to study the impact of genetic variation on complex traits and diseases.
Controlled access to raw and protected data that may identify donors is provided through the AnVIL Project. Established in 2018, the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) provides a cloud-based environment that colocates an extensive collection of high-value datasets with commonly used bioinformatics tools in a secure computing environment.
In support of the evolving nature of the NHGRI mission, we are pleased to announce that researchers are now able to download controlled access GTEx V8 to local compute infrastructure without incurring egress fees. Researchers will still need to apply for approval from dbGaP to access these data and maintain data security on their institutional clusters, but this new capability will save researchers substantial expenses, especially for researchers that wish to integrate GTEx with their own clinical research data. Download instructions can be found at https://anvilproject.org/learn/reference/gtex-v8-free-egress-instructions.
We foresee that in the long term, more users will choose to perform the analysis of GTEx and other large datasets directly within AnVIL’s cloud environment. AnVIL offers an elastic, shared computing resource, with active threat detection and monitoring, that provides an increasingly attractive alternative to redundantly downloaded data amongst siloed compute infrastructure.
AnVIL is built on a set of established components that have been used in a number of flagship scientific projects. The Terra platform provides a compute environment with secure data and analysis sharing capabilities. Dockstore provides standards based sharing of containerized tools and workflows. R/Bioconductor and Galaxy provide environments for users at different skill levels to construct and execute analyses. The Gen3 data commons framework provides data and metadata ingest, querying, and organization.
Taken together, the AnVIL cloud platform allows researchers to build novel cohorts out of GTEx and other major NHGRI datasets such as CCDG, CMG, and eMERGE and compute on them in place with scalable and easy to use tools.