Step 2 - Set Up a Data Model
After your dataset has been approved by the AnVIL data ingestion committee, you will need to set up and submit your data model, specifying what data you have and how data are connected.
An AnVIL data model is intended to:
- Accept/store as much data as possible
- Maximize data findability and usefulness, and facilitate cross-study analysis
The first goal requires a very flexible data model. The second requires some constraints on the data model. The guidelines below are intended to help meet those two goals.
You can choose to start with one of two template data models and adjust to meet your needs. You’ll coordinate with the AnVIL data ingest team to facilitate this. If your dataset has been accepted by AnVIL and does not easily fit into an existing template, reach out to the AnVIL Team at firstname.lastname@example.org.
You’ll end this step by completing an intake form to send the data model (in a data dictionary spreadsheet) and all the information the AnVIL team needs to set up your data workspace on the AnVIL.
2.1 - First Steps
Coordinate with the AnVIL Data Ingest Team
Email email@example.com to arrange an AnVIL kickoff meeting to discuss your data, data model, and ingest timeline.
Register for a Terra Account
AnVIL data are stored and organized in Terra data-oriented workspaces. You will need a Terra Account to upload data into AnVIL. If you do not already have an account on Terra, you will find step-by-step instructions to register at Creating a Terra Account.
Registering for Terra account is free and the AnVIL pays all costs associated with uploading and storing your data. Note that in order to complete an analysis, you will need to connect a Google Billing Account to your Terra account. See Overview of Billing Concepts and Creating a Google Cloud Billing Account for more information.
2.2 - Create Your Data Model
Nodes in the AnVIL Data Model (e.g. “Program” or “Subject” etc. in the diagram below) include different types of data (called “properties”). Each node is a table in a data workspace in AnVIL. Nodes are connected to each other by unique IDs.
Data submitters will submit data and metadata from the Biospecimen, Clinical, and Data File nodes in spreadsheet-like files that will be displayed in the data workspace as integrated tables. Each row is a distinct “entity” and each column is a different property (type of data).
A data model consists of these components:
- Entities: the primary object the table contains with a unique key (i.e. a “subject” entity for phenotypic data or “sample” entity for genomic data). Each row in the table is a distinct entity identified by an ID key.
- Attributes/properties: the columns in a database table (i.e. phenotypic data like demographic or lab results or genomic data metadata like)
- Associations: the unique identifiers that link data between tables (i.e. a
subject_idcolumn in the sample table that links samples with the subject)
Data Model Requirements
Please read the descriptions below carefully, and reach out to your AnVIL team contact with any questions. These requirements below help ensure AnVIL datasets are not only useful to the researchers who created them but enable others to analyze data collectively across studies.
Start by thinking of what data you have and how you have already organized it. Note that to accommodate the most data, AnVIL data models allow as many attribute columns as you need. The requirements help structure all AnVIL data similarly, and make it compatible with analysis in the AnVIL Terra platform.
Required Tables for All Studies (csv, tsv, txt, json format)
All studies must submit the following tables (scroll down for details and template tables):
- Data Dictionary Table: Specifies the entire data model. It includes (for each separate table in the data model) field names, field descriptions, field types, enumeration values (where applicable), multi-value delimiter symbol used (where applicable)
- Subject Table: Includes required information about the subjects and (usually) associated phenotypic data. The
subject_id(first column) is the key field for that table. This key is typically used in other tables to link additional data (i.e. genomic, sequencing, family) to the subject.
- Sample Table: Links the
sample_id(first column) is the key fields for that table
- Sequencing Table: Includes required information about and links to the sequence data associated with the
sample_idwhere the filename is the key field for the table
Example Additional Tables (CSV format)
- Family Table: Includes information about a particular family with the
family_id(first column) as the key field for the table. Data can include pedigrees or any other family-level information.
- Discovery Table: Includes information about variants of interest that are linked to the
subject_id(must include a
Template Data Models
To enable cross-study analysis within the AnVIL, data submitted for hosting by AnVIL should be consistent with data models already in AnVIL or in the process of being ingested into AnVIL when possible. To ensure this, we recommend you adopt or modify one of the Data Model templates below. Note that these are read-only copies. You should make your own to modify.
Non-standard Data Models
If your dataset has been accepted by AnVIL and has needs not described here, please reach out to the AnVIL Team at firstname.lastname@example.org. You can work directly with the AnVIL Phenotype WG and Data Processing WG to integrate into one of these existing data models.
2.3 - Generate Your Data Dictionary
All AnVIL studies must submit a Data dictionary table (spreadsheet file) that defines your complete data model. It includes (for each table separately) field names, field descriptions, field types, examples, enumeration values (where applicable), and multi-value delimiter symbols used (where applicable) for each table in the data model.
General Formatting Requirements
To be compatible with indexing once in AnVIL, special characters (i.e. “%” or “*”) cannot be used in any field or file name. If your files contain special characters, they must all be removed/replaced before ingestion.
Phenotypic Data Expectations
Currently, data stored in a phenotypic (“subject”) table will fall into one of four categories. Requirements for each category are below.
- Case/Control (or Case alone) - Information around a particular disease or phenotype of interest for a selected cohort (Example: CMG, CCDG).
- Electronic Health Record (EHR) - Data derived from EHR information (Example: eMERGE).
- Survey - Data collected from surveying study subjects (Example: CSER).
- Family longitudinal - Data collected for multiple families for multiple generations (Example: AMISH).
If your data does not fall into one of the above categories, please reach out to the AnVIL Team (email@example.com).
Required Phenotypic Data
To ensure cross-study functionality on AnVIL, dataset categories have the following requirements.
- 1 - Required
- 2 - Required if there are trios or other relationship data in the study.
|Data Elements||Case or Case/Control||EHR||Survey||Family longitudinal|
Ensuring Uniform Terminology
AnVIL includes a diverse set of studies and a wide range of collected phenotypic data. To maximize useful information for search and synthetic cohort creation, all phenotypic data:
- Must be clearly linked to a subject, and the subject must be clearly linked to other data (e.g., genome, exome, RNASeq, array, etc.).
- Must be composed (where possible) of structured values. Ideally, these values are concept codes from established ontologies including, but not limited to:
- NCIt - A vocabulary for a diverse set of biological concepts (e.g., disease, phenotype, relationship, anatomy, etc.).
- SNOMED - A vocabulary focused on concepts related to clinical data (license required).
- UMLS Metathesaurus - Links concepts from multiple vocabularies and ontologies (license required, free to individuals in the USA, includes access to SNOMED).
- UBERON - A vocabulary focused on anatomical structure.
- HPO - An ontology focused on phenotypic abnormalities.
- OMIM - An ontology for rare Mendelian diseases.
- Orphanet - An ontology for orphan drugs and rare diseases.
- ICD - An ontology for US billing codes.
- MeSH - An ontology for biomedical and health-related information.
- RxNorm - Normalized names for clinical drugs and links to many of the drug vocabularies.
Genomic Data Expectations
Known Data Use Limitations (DUL) need to be clearly defined by the data depositor. This is the list of requirements for gaining access and using the data. You will need to submit your protocols for gaining access at the time of ingest.
Please contact your program officer and the NHGRI Genomic Program Administrator for assistance and/or questions about dbGaP registration and/or consent groups.
- Understanding entity types and the default entity types in the standard genomic model (estimated time 10 minutes).
- Formatting requirements for data tables and template upload files.
- Introduction to Data Tables in Terra (5 minutes)
For hands-on practice with a data model and data tables in Terra, please go through parts 1 and 2 of the Terra Data Tables Quickstart tutorial (estimated time 30-40 minutes).
- On This Page
- 2.1 - First Steps
- Coordinate with the AnVIL Data Ingest Team
- Register for a Terra Account
- 2.2 - Create Your Data Model
- Data Model Requirements
- Required Tables for All Studies (csv, tsv, txt, json format)
- Example Additional Tables (CSV format)
- Template Data Models
- 2.3 - Generate Your Data Dictionary
- General Formatting Requirements
- Phenotypic Data Expectations
- Required Phenotypic Data
- Ensuring Uniform Terminology
- Genomic Data Expectations
- Access Restrictions
- Additional Resources
- Hands-on tutorial