AnVIL Portal

Step 2 - Set Up a Data Model


As you get started, we recommend you review the AnVIL Data Model Data Dictionary here to learn more about the general structure (which we will describe in more detail below). You’ll coordinate with the AnVIL data ingest team to facilitate submission activities.

If your dataset has been accepted by AnVIL and does not easily fit into an existing template, please reach out to the AnVIL Team at help@lists.anvilproject.org.

You’ll end this step with a better understanding of what your data dictionary will include for your data model. Step 3 will instruct you on preparing your data according to your data model for submission to AnVIL.

2.1 - First Steps

Coordinate with the AnVIL Data Ingest Team

To collaborate with AnVIL on uploading data, you’ll notify AnVIL of completed dbGaP registration in one of two ways.

It is useful to coordinate with AnVIL before you start to set up your data model, in case you run into questions or problems.

2.2 - Create Your AnVIL Data Model

A data model explicitly determines the structure of data. It organizes data elements and standardizes how the data elements relate to one another.

We encourage you to read the README and AnVIL Table Overview tabs in the AnVIL Data Model data dictionary as you begin creating your AnVIL Data Model.

These tabs include helpful information such as

  • The overall purpose of the data dictionary
  • Information on the AnVIL Data Model (findability subset)
  • Details on how to expand the data model to fit your needs
  • Expectations on how to submit your schema

Data Model Requirements

The linked Data Dictionary provides context for mapping data to the AnVIL core findability model. It includes information on the tables to be included, which concepts should be included in each table, and suggested coding systems for each. The Data Dictionary is a learning resource to assist users with creating their own AnVIL Data Model for submission.

Please read the descriptions for each table outlined in the Data Dictionary carefully, and reach out to your AnVIL team contact with questions not addressed in the data dictionary document. These data model requirements help ensure AnVIL datasets are not only useful to the researchers who created them but also enable others to analyze data collectively across studies in the AnVIL Terra platform.

Start by thinking of what data you have and how you have already organized it and how it may fit or need to be reorganized to fit the requirements.

Core Tables for All Studies

At a high level, there are three core tables (“entities”) in the AnVIL Data Model: BioSample, Donor, and File.

The BioSample table is the only required table, although Donor and File tables are strongly encouraged to improve the usability and findability of the data.

The specifications for each table, including required fields, strongly recommended fields, field names, and field descriptions, are included in the Data Dictionary. Brief descriptions of these core tables, with notable requirements, are below.

  • BioSample (required): Contains information about the sample(s) included in the study. Example data types include Anatomical Site from which the biosample was taken and the cell type.
    • The biosample_id (first column) is required.
    • If a File table will not be submitted, the BioSample table must contain a column indicating which samples correspond to which files.
  • Donor (strongly recommended): Contains demographic and phenotypic information about the donor. Example data types include phenotypic sex, reported ethnicity, and genetic ancestry.
    • The donor_id (first column) is required.
  • File (strongly recommended): Contains information for files associated with the study. Example data includes DRS ID, filename, and file type.
    • The file_id (first column) is required.
    • It is strongly recommended that the table includes a BioSample.biosample_id column to link the biosample id between tables.
    • AnVIl will add DRS URIs for object files as part of the ingest process.

Optional Tables

The AnVIL Data Model Data Dictionary includes other optional tables you can use to contain data about conditions, activity, and your project. Brief descriptions of these optional tables, with notable requirements, are below.

  • Condition Table: Contains information about condition(s) and phenotypes associated with a donor.
    • The condition_id (first column) is required.
    • It is strongly recommended that the table includes a donor_id column to link the donor id between tables.
  • Activity Table: Contains details on different types of activities used to generate or process data. Example data includes sequencing method, reference assembly, and assay type.
    • The activity_id (first column) is required.
    • Should include a file_id column that references the associated file.
  • Project Table: Contains information about the project the study is a part of. Example data includes funding, and principal investigator. It is strongly recommended that the table includes a title field.
    • The project_id (first column) is required.

Non-standard Data Models

If your dataset has been accepted by AnVIL and has needs not described here, please refer to the README on “Expanding the data model”. Additional tables must be associated with one of the core tables: BioSample, Donor, or File. If you need further assistance, please reach out to the AnVIL Team at help@lists.anvilproject.org. You can work directly with the AnVIL Phenotype WG and Data Ingest WG to integrate into one of these existing data models.

2.3 - Generate Your Data Dictionary

All AnVIL studies must submit a Data Dictionary table (spreadsheet file) that defines your complete data model. It includes (in separate tabs for each table) field names, field descriptions, field types, examples, enumeration values (where applicable), and multi-value delimiter symbols used (where applicable) for each table in the data model.

For a template Data Dictionary with all required and suggested tables, click here. To download the AnVILDataSubmissionFindabilitySubsetSchema.template.xlsx file, click on the three-dot icon at the top right and then click Download. 

Suggested Proprietary Coding Systems

TableConcept(s)Coding systems
DonorOrganism type label, Organism type coding systemNCI Organismal Classification
DonorDonor typeDonor type values
DonorPhenotype sex codePhenotype sex coding system
DonorReported ethnicity label, Genetic ancestry labelSuggested Ancestry Values
DonorReported ethnicity coding system, Genetic ancestry coding systemGenetic ancestry coding system
BioSampleAnatomical site labelAnatomical Site Allowed Values
BioSampleAnatomical site coding systemAnatomical site coding system
BioSampleA priori cell type labelA priori cell type suggested values
BioSampleA priori cell type coding systemA priori cell type
BioSampleBioSample type labelBioSample type values
BioSampleBioSample type coding systemBioCore Terms
BioSamplePrimary condition labelSuggested primary condition label values
BioSamplePrimary condition coding system
BioSamplePrimary condition affected statusBoolean true/false
ActivityActivity type labelActivity type values
ActivityActivity type coding systemBioCore Values

Additional Resources

Contact information

AnVIL Data Ingest Team anvil-data@broadinstitute.org AnVIL Help Team help@lists.anvilproject.org


Help us make these docs great!
All AnVIL docs are open source. See something that’s wrong or unclear? Submit a pull request.
Make a contribution