Step 2 - Set Up a Data Model

After your dataset has been approved by the AnVIL Data Ingestion Committee, you will need to set up and submit your data model, specifying what data you have and how data are connected.

The AnVIL Data Model is intended to:

Standardize the data submitted to the AnVIL in order to accept a broad range of high-quality data across consortia
Maximize data findability and usefulness, and facilitate cross-study analysis

The first goal requires a very flexible data model, and the second requires some constraints on the data model. The guidelines below are intended to help meet those two goals.

As you get started, we recommend you review the AnVIL Data Model Data Dictionary here to learn more about the general structure (which we will describe in more detail below). You’ll coordinate with the AnVIL data ingest team to facilitate submission activities.

If your dataset has been accepted by AnVIL and does not easily fit into an existing template, please reach out to the AnVIL Team at help@lists.anvilproject.org.

You’ll end this step with a better understanding of what your data dictionary will include for your data model. Step 3 will instruct you on preparing your data according to your data model for submission to AnVIL.

2.1 - First Steps

Definitions

Data Model: An implementable definition of a data representation. A data model should describe the entities, attributes, and relationships, and how these are represented, including format and content. Successful models facilitate clear communication of the data and cover the majority of the data in a well-specified domain.

Data Dictionary: An unambiguous description of a specific table of data. A data dictionary should define what the fields are, what values they can have, and any constraints. Data dictionaries must allow for data to be validated for conformance. Data dictionaries:

Must include descriptions for each column for humans to understand the data
Must include data types and coded terms for computers to understand the data

Why a Data Model?

There are often inconsistencies when data is shared, including what ontologies used, formatting, and data types. Established data models provide clear and well-specified standards, formatting, terminologies, and ontologies. This allows for easy exchange and use. The purpose of using a data model when submitting to AnVIL is to maximize the usability of the data. Through using a data model, AnVIL aims to have research data that are Findable, Accessible, Interoperable, and Reusable (FAIR).

Coordinate with the AnVIL Data Ingest Team

To collaborate with AnVIL on uploading data, you’ll notify AnVIL of completed dbGaP registration in one of two ways.

Through an already open AnVIL Zendesk ticket
By contacting AnVIL support at anvil-data@broadinstitute.org

It is useful to coordinate with AnVIL before you start to set up your data model, in case you run into questions or problems.

2.2 - Create Your AnVIL Data Model

A data model explicitly determines the structure of data. It organizes data elements and standardizes how the data elements relate to one another. Data models should describe data with rich metadata that can be registered or indexed.

We encourage you to read the README and AnVIL Table Overview tabs in the AnVIL Data Model data dictionary as you begin creating your AnVIL Data Model.

These tabs include helpful information such as

The overall purpose of the data dictionary
Information on the AnVIL Data Model (findability subset)
Details on how to expand the data model to fit your needs
Expectations on how to submit your schema

Data Model Requirements

The linked Data Dictionary provides context for mapping data to the AnVIL core findability model. It includes information on the tables to be included, which concepts should be included in each table, and suggested coding systems for each. The Data Dictionary is a learning resource to assist users with creating their own AnVIL Data Model for submission.

Please read the descriptions for each table outlined in the Data Dictionary carefully, and reach out to your AnVIL team contact with questions not addressed in the data dictionary document. These data model requirements help ensure AnVIL datasets are not only useful to the researchers who created them but also enable others to analyze data collectively across studies in the AnVIL Terra platform.

Start by thinking of what data you have and how you have already organized it and how it may fit or need to be reorganized to fit the requirements.

Core Tables for All Studies

At a high level, there are three core tables (“entities”) in the AnVIL Data Model: BioSample, Donor, and File.

The BioSample table is the only required table, although Donor and File tables are strongly encouraged to improve the usability and findability of the data.

The specifications for each table, including required fields, strongly recommended fields, field names, and field descriptions, are included in the Data Dictionary. Brief descriptions of these core tables, with notable requirements, are below.

Note that data models for submitted datasets are not restricted to what is included in the AnVIL Data Model.

BioSample (required): Contains information about the sample(s) included in the study. Example data types include Anatomical Site from which the biosample was taken and the cell type.
- The biosample_id (first column) is required.
- If a File table will not be submitted, the BioSample table must contain a column indicating which samples correspond to which files.
Donor (strongly recommended): Contains demographic and phenotypic information about the donor. Example data types include phenotypic sex, reported ethnicity, and genetic ancestry.
- The donor_id (first column) is required.
File (strongly recommended): Contains information for files associated with the study. Example data includes DRS ID, filename, and file type.
- The file_id (first column) is required.
- It is strongly recommended that the table includes a BioSample.biosample_id column to link the biosample id between tables.
- AnVIL will add DRS URIs for object files as part of the ingest process.

Optional Tables

The AnVIL Data Model Data Dictionary includes other optional tables you can use to contain data about conditions, activity, and your project. Brief descriptions of these optional tables, with notable requirements, are below.

Condition Table: Contains information about condition(s) and phenotypes associated with a donor.
- The condition_id (first column) is required.
- It is strongly recommended that the table includes a donor_id column to link the donor id between tables.
Activity Table: Contains details on different types of activities used to generate or process data. Example data includes sequencing method, reference assembly, and assay type.
- The activity_id (first column) is required.
- Should include a file_id column that references the associated file.
Project Table: Contains information about the project the study is a part of. Example data includes funding, and principal investigator. It is strongly recommended that the table includes a title field.
- The project_id (first column) is required.

Non-standard Data Models

If your dataset has been accepted by AnVIL and has needs not described here, please refer to the README on “Expanding the data model”. Additional tables must be associated with one of the core tables: BioSample, Donor, or File. If you need further assistance, please reach out to the AnVIL Team at help@lists.anvilproject.org. You can work directly with the AnVIL Phenotype WG and Data Ingest WG to integrate into one of these existing data models.

2.3 - Generate Your Data Dictionary

All AnVIL studies must submit a Data Dictionary table (spreadsheet file) that defines your complete data model. It includes (in separate tabs for each table)

Field names
Field descriptions
Field types
Examples
Enumeration values (where applicable)
Multi-value delimiter symbols used (where applicable) for each table in the data model

To maximize usability, data models should anchor to the listed coding systems.

For a template Data Dictionary with all required and suggested tables, click here. To download the AnVILDataSubmissionFindabilitySubsetSchema.template.xlsx file, click on the three-dot icon at the top right and then click Download.

Suggested Proprietary Coding Systems

Table	Concept(s)	Coding systems
Donor	Organism type label, Organism type coding system	NCI Organismal Classification
Donor	Donor type	Donor type values
Donor	Phenotype sex code	Phenotype sex coding system
Donor	Reported ethnicity label, Genetic ancestry label	Suggested Ancestry Values
Donor	Reported ethnicity coding system, Genetic ancestry coding system	Genetic ancestry coding system
BioSample	Anatomical site label	Anatomical Site Allowed Values
BioSample	Anatomical site coding system	Anatomical site coding system
BioSample	A priori cell type label	A priori cell type suggested values
BioSample	A priori cell type coding system	A priori cell type
BioSample	BioSample type label	BioSample type values
BioSample	BioSample type coding system	BioCore Terms
BioSample	Primary condition label	Suggested primary condition label values
BioSample	Primary condition coding system
BioSample	Primary condition affected status	Boolean true/false
Activity	Activity type label	Activity type values
Activity	Activity type coding system	BioCore Values

Additional Resources

Contact information

AnVIL Data Ingest Team anvil-data@broadinstitute.org
AnVIL Help Team help@lists.anvilproject.org

Help us make these docs great!

All AnVIL docs are open source. See something that’s wrong or unclear? Submit a pull request.

Make a contribution