Step 2 - Set Up a Data Model
As you get started, we recommend you review the AnVIL Data Model Data Dictionary here to learn more about the general structure (which we will describe in more detail below). You’ll coordinate with the AnVIL data ingest team to facilitate submission activities.
If your dataset has been accepted by AnVIL and does not easily fit into an existing template, please reach out to the AnVIL Team at help@lists.anvilproject.org.
You’ll end this step with a better understanding of what your data dictionary will include for your data model. Step 3 will instruct you on preparing your data according to your data model for submission to AnVIL.
2.1 - First Steps
Definitions
Data Model: An implementable definition of a data representation. A data model should describe the entities, attributes, and relationships, and how these are represented, including format and content. Successful models facilitate clear communication of the data and cover the majority of the data in a well-specified domain.
Data Dictionary: An unambiguous description of a specific table of data. A data dictionary should define what the fields are, what values they can have, and any constraints. Data dictionaries must allow for data to be validated for conformance. Data dictionaries:
- Must include descriptions for each column for humans to understand the data
- Must include data types and coded terms for computers to understand the data
Why a Data Model?
There are often inconsistencies when data is shared, including what ontologies used, formatting, and data types. Established data models provide clear and well-specified standards, formatting, terminologies, and ontologies. This allows for easy exchange and use. The purpose of using a data model when submitting to AnVIL is to maximize the usability of the data. Through using a data model, AnVIL aims to have research data that are Findable, Accessible, Interoperable, and Reusable (FAIR).
Coordinate with the AnVIL Data Ingest Team
To collaborate with AnVIL on uploading data, you’ll notify AnVIL of completed dbGaP registration in one of two ways.
- Through an already open AnVIL Zendesk ticket
- By contacting AnVIL support at anvil-data@broadinstitute.org
It is useful to coordinate with AnVIL before you start to set up your data model, in case you run into questions or problems.
2.2 - Choose Your Data Model
A data model explicitly determines the structure of data. It organizes data elements and standardizes how the data elements relate to one another. Data models should describe data with rich metadata that can be registered or indexed.
Data Submitters can submit their data in one of these submitted models, or leverage your own, as long as the minimal required data elements are included:
- GREGoR Data Model: For use by groups that can provide powerful structured data, have robust omics data, or have performed family or rare disease studies.
- OMOP Common Data Model: For use by groups providing longitudinal Electronic Health Record data.
- AnVIL Findability Subset Data Model: For use by submitters without a data model who are providing the minimum required data for submission.
Please find more information about the AnVIL Data Model and OMOP below!
2.2.1 - The AnVIL Findability Subset Data Model
We encourage you to read the README and AnVIL Table Overview tabs in the AnVIL Findability Subset Data Model data dictionary as you begin creating your Data Model.
These tabs include helpful information such as
- The overall purpose of the data dictionary
- Information on the AnVIL Data Model (findability subset)
- Details on how to expand the data model to fit your needs
- Expectations on how to submit your schema
Data Model Requirements
The linked Data Dictionary provides context for mapping data to the AnVIL Findability Subset Data Model. It includes information on the tables to be included, which concepts should be included in each table, and suggested coding systems for each. The Data Dictionary is a learning resource to assist users with creating their own AnVIL Data Model for submission.
Please read the descriptions for each table outlined in the Data Dictionary carefully, and reach out to your AnVIL team contact with questions not addressed in the data dictionary document. These data model requirements help ensure AnVIL datasets are not only useful to the researchers who created them but also enable others to analyze data collectively across studies in the AnVIL Terra platform.
If you decide to submit data using the AnVIL Findability Subset Data Model, we recommend starting by thinking of what data you have and how it is already organized, and how it may fit or need to be reorganized to fit the requirements.
Core Tables for All Studies
At a high level, there are three core tables (“entities”) in the AnVIL Findability Subset Data Model: BioSample, Donor, and File.
The BioSample table is the only required table, although Donor and File tables are strongly encouraged to improve the usability and findability of the data.
The specifications for each table, including required fields, strongly recommended fields, field names, and field descriptions, are included in the Data Dictionary. Brief descriptions of these core tables, with notable requirements, are below.
Note that data models for submitted datasets are not restricted to what is included in the AnVIL Findability Subset Data Model as long as the minimal required data elements are included.
- BioSample (required): Contains information about the sample(s) included in the study. Example data types include Anatomical Site from which the biosample was taken and the cell type.
- The
biosample_id(first column) is required. - If a File table will not be submitted, the BioSample table must contain a column indicating which samples correspond to which files.
- The
- Donor (strongly recommended): Contains demographic and phenotypic information about the donor. Example data types include phenotypic sex, reported ethnicity, and genetic ancestry.
- The
donor_id(first column) is required.
- The
- File (strongly recommended): Contains information for files associated with the study. Example data includes DRS ID, filename, and file type.
- The
file_id(first column) is required. - The
file_sizeandfile_md5sumcolumns are required to support file validation. - It is strongly recommended that the table includes a
biosample_idcolumn to link the biosample id between tables. - AnVIL will add DRS URIs for object files as part of the ingest process.
- The
Optional Tables
The Data Dictionary includes other optional tables you can use to contain data about conditions, activity, and your project. Brief descriptions of these optional tables, with notable requirements if they are used, are below.
- Condition Table: Contains information about condition(s) and phenotype(s) associated with a donor.
- The
condition_id(first column) is required. - It is strongly recommended that the table includes a
donor_idcolumn to link the donor id between tables.
- The
- Activity Table: Contains details on different types of activities used to generate or process data. Example data includes sequencing method, reference assembly, and assay type.
- The
activity_id(first column) is required. - Should include a
file_idcolumn that references the associated file.
- The
- Project Table: Contains information about the project the study is a part of. Example data includes funding, and principal investigator. It is strongly recommended that the table includes a title field.
- The
project_id(first column) is required.
- The
Non-standard Data Tables
If your dataset has been accepted by AnVIL and has needs not described here, please refer to the README on “Expanding the data model” for the AnVIL Data Model. Additional tables must be associated with one of the core tables: BioSample, Donor, or File. If you need further assistance, please reach out to the AnVIL Team at help@lists.anvilproject.org. You can work directly with the AnVIL Phenotype WG and Data Ingest WG to integrate into one of these existing data models.
Suggested Proprietary Coding Systems
| Table | Concept(s) | Coding systems |
|---|---|---|
| Donor | Organism type label, Organism type coding system | NCI Organismal Classification |
| Donor | Donor type | Donor type values |
| Donor | Phenotype sex code | Phenotype sex coding system |
| Donor | Reported ethnicity label, Genetic ancestry label | Suggested Ancestry Values |
| Donor | Reported ethnicity coding system, Genetic ancestry coding system | Genetic ancestry coding system |
| BioSample | Anatomical site label | Anatomical Site Allowed Values |
| BioSample | Anatomical site coding system | Anatomical site coding system |
| BioSample | A priori cell type label | A priori cell type suggested values |
| BioSample | A priori cell type coding system | A priori cell type |
| BioSample | BioSample type label | BioSample type values |
| BioSample | BioSample type coding system | BioCore Terms |
| BioSample | Primary condition label | Suggested primary condition label values |
| BioSample | Primary condition coding system | |
| BioSample | Primary condition affected status | Boolean true/false |
| Activity | Activity type label | Activity type values |
| Activity | Activity type coding system | BioCore Values |
2.2.2 - OMOP Common Data Model
To support FAIR data management on the AnVIL, Data Submitters who have longitudinal clinical data are recommended to submit their data in the OMOP CDM, instead of the AnVIL Data Model. Submitters may read more about the OMOP CDM and the wider OHDSI initiative here: https://ohdsi.github.io/CommonDataModel/background.html
Submitters are encouraged to review publicly available OMOP specs and choose a CDM version that best fits their data needs:
- OMOP 5.3: https://ohdsi.github.io/CommonDataModel/cdm53.html
- OMOP 5.4: https://ohdsi.github.io/CommonDataModel/cdm54.html
For differences between OMOP 5.3 and 5.4, submitters may refer to this link: https://ohdsi.github.io/CommonDataModel/cdm54Changes.html
Submitting in OMOP
Please submit your OMOP tables in your elected CDM version following these requirements:
- Copy and complete the [Template] OMOP Submission Form for your data
- Submit your OMOP tables as tsv files
- Provide these minimum required OMOP Domain tables: Person (please do not include PII in your submission), Death, Visit_Occurrence, Visit_Detail, Condition_Occurrence, Drug_Exposure, Procedure_Occurrence, Device_Exposure, Measurement, Observation, Specimen
- If there is no data available for a specific, required domain table, do not include the table in your submission and indicate it is “Not Available” in the manifest
- Name your OMOP table files according to the domain names
- Example: person.tsv; observation.tsv
- Due to licensing constraints, the OMOP vocabulary cannot be redistributed and must be excluded from your submission
- However, you may have local codes that you are using to represent your data!
- Please provide your local codes in the vocabulary, concept, concept_relationship, and concept_ancestor tables, following OMOP specifications for your CDM version indicated and exclude the OMOP vocabulary from your submission
- Provide these minimum required OMOP Domain tables: Person (please do not include PII in your submission), Death, Visit_Occurrence, Visit_Detail, Condition_Occurrence, Drug_Exposure, Procedure_Occurrence, Device_Exposure, Measurement, Observation, Specimen
We will validate your OMOP dataset using your version number at submission. As a part of your submission process, we will return a data quality report to you. This report will include information about adherence to OMOP standard specs and contain basic data characterization. We will also make the final report available to research users.
2.3 - Generate Your Data Dictionary
Data Dictionaries are vital for AnVIL operational and AnVIL scientific consumers to understand and utilize the data being submitted. To support consistent ingest, storage, and downstream usage, AnVIL requires that all data submissions include a Data Dictionary.
For a template Data Dictionary, click here.
Data Dictionary Validation
Data Dictionaries provided to AnVIL will be validated for both completeness and clarity of information and alignment with the data included in the data submission. The ideal data dictionary is one that has an entry for every relevant field in the data dictionary template, the descriptive elements of those fields are precise and clear enough that even those wholly unaffiliated with the study can understand their purpose, and the provided information reflects the actual state of the data included with the submission.