AnVIL hosts high value datasets relevant to human health and disease.

AnVIL data sets include NHGRI funded data sets of the Centers for Commons Disease Genetics (CCDG), the Centers for Mendelian Genomics (CMG), and the Genotype Tissue-Expression project (GTEx). Additional data sets include the high coverage 1000 Genomes whole genome sequencing data and VCFs, with more data sets to be added over time.

Researchers can gain access to AnVIL hosted data sets by submitting a data access request through dbGaP for CCDG, CMG, or GTEx data. All 1000 Genomes data is publicly accessible.

Users can upload their own data to AnVIL and utilize access controlled sharing mechanisms to manage and control access within secure environments. These access controls can also be leveraged to protect pre-release datasets while collaborators work in the cloud.

AnVIL aims to host data sets of high value to the biomedical research community, serving basic research as well as clinical research. AnVIL is open to hosting additional data sets, including public access and restricted access data. See the AnVIL data ingestion guide for further information (coming soon).

Data Sets


Large-scale genome sequencing effort to comprehensively identify rare risk and protective variants contributing to multiple common disease phenotypes.


A multi-center collaboration aimed at identifying the genes responsible for Mendelian phenotypes by whole exome and whole genome sequencing


The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq.

1000 G

The 1000 Genomes Project, launched in January 2008, is an international research effort to establish variation profiles across the human population. This open access data set continues to be a valuable resource to geneticists.


The Electronic and MEdical Records and Genomics project (eMERGE) is a national network organized and funded by the NHGRI that combines DNA biorepositories with electronic medical record (EMR) systems for large scale, high-throughput genetic research in support of implementing genomic medicine.

Improve this page