Step 2 - Set Up a Data Model


As you get started, we recommend you review the AnVIL Data Model Data Dictionary here to learn more about the general structure (which we will describe in more detail below). You’ll coordinate with the AnVIL data ingest team to facilitate submission activities.

If your dataset has been accepted by AnVIL and does not easily fit into an existing template, please reach out to the AnVIL Team at help@lists.anvilproject.org.

You’ll end this step with a better understanding of what your data dictionary will include for your data model. Step 3 will instruct you on preparing your data according to your data model for submission to AnVIL.

2.1 - First Steps

Definitions

Data Model: An implementable definition of a data representation. A data model should describe the entities, attributes, and relationships, and how these are represented, including format and content. Successful models facilitate clear communication of the data and cover the majority of the data in a well-specified domain.

Data Dictionary: An unambiguous description of a specific table of data. A data dictionary should define what the fields are, what values they can have, and any constraints. Data dictionaries must allow for data to be validated for conformance. Data dictionaries:

  • Must include descriptions for each column for humans to understand the data
  • Must include data types and coded terms for computers to understand the data

Why a Data Model?

There are often inconsistencies when data is shared, including what ontologies used, formatting, and data types. Established data models provide clear and well-specified standards, formatting, terminologies, and ontologies. This allows for easy exchange and use. The purpose of using a data model when submitting to AnVIL is to maximize the usability of the data. Through using a data model, AnVIL aims to have research data that are Findable, Accessible, Interoperable, and Reusable (FAIR).

Coordinate with the AnVIL Data Ingest Team

To collaborate with AnVIL on uploading data, you’ll notify AnVIL of completed dbGaP registration in one of two ways.

It is useful to coordinate with AnVIL before you start to set up your data model, in case you run into questions or problems.

2.2 - Choose Your Data Model

A data model explicitly determines the structure of data. It organizes data elements and standardizes how the data elements relate to one another. Data models should describe data with rich metadata that can be registered or indexed.

Data Submitters can submit their data in one of these submitted models, or leverage your own, as long as the minimal required data elements are included:

  • GREGoR Data Model: For use by groups that can provide powerful structured data, have robust omics data, or have performed family or rare disease studies.
  • OMOP Common Data Model: For use by groups providing longitudinal Electronic Health Record data.
  • AnVIL Findability Subset Data Model: For use by submitters without a data model who are providing the minimum required data for submission.

Please find more information about the AnVIL Data Model and OMOP below!

2.2.1 - The AnVIL Findability Subset Data Model

We encourage you to read the README and AnVIL Table Overview tabs in the AnVIL Findability Subset Data Model data dictionary as you begin creating your Data Model.

These tabs include helpful information such as

  • The overall purpose of the data dictionary
  • Information on the AnVIL Data Model (findability subset)
  • Details on how to expand the data model to fit your needs
  • Expectations on how to submit your schema

Data Model Requirements

The linked Data Dictionary provides context for mapping data to the AnVIL Findability Subset Data Model. It includes information on the tables to be included, which concepts should be included in each table, and suggested coding systems for each. The Data Dictionary is a learning resource to assist users with creating their own AnVIL Data Model for submission.

Please read the descriptions for each table outlined in the Data Dictionary carefully, and reach out to your AnVIL team contact with questions not addressed in the data dictionary document. These data model requirements help ensure AnVIL datasets are not only useful to the researchers who created them but also enable others to analyze data collectively across studies in the AnVIL Terra platform.

If you decide to submit data using the AnVIL Findability Subset Data Model, we recommend starting by thinking of what data you have and how it is already organized, and how it may fit or need to be reorganized to fit the requirements.

Core Tables for All Studies

At a high level, there are three core tables (“entities”) in the AnVIL Findability Subset Data Model: BioSample, Donor, and File.

The BioSample table is the only required table, although Donor and File tables are strongly encouraged to improve the usability and findability of the data.

The specifications for each table, including required fields, strongly recommended fields, field names, and field descriptions, are included in the Data Dictionary. Brief descriptions of these core tables, with notable requirements, are below.

Note that data models for submitted datasets are not restricted to what is included in the AnVIL Findability Subset Data Model as long as the minimal required data elements are included.

  • BioSample (required): Contains information about the sample(s) included in the study. Example data types include Anatomical Site from which the biosample was taken and the cell type.
    • The biosample_id (first column) is required.
    • If a File table will not be submitted, the BioSample table must contain a column indicating which samples correspond to which files.
  • Donor (strongly recommended): Contains demographic and phenotypic information about the donor. Example data types include phenotypic sex, reported ethnicity, and genetic ancestry.
    • The donor_id (first column) is required.
  • File (strongly recommended): Contains information for files associated with the study. Example data includes DRS ID, filename, and file type.
    • The file_id (first column) is required.
    • The file_size and file_md5sum columns are required to support file validation.
    • It is strongly recommended that the table includes a biosample_id column to link the biosample id between tables.
    • AnVIL will add DRS URIs for object files as part of the ingest process.

Optional Tables

The Data Dictionary includes other optional tables you can use to contain data about conditions, activity, and your project. Brief descriptions of these optional tables, with notable requirements if they are used, are below.

  • Condition Table: Contains information about condition(s) and phenotype(s) associated with a donor.
    • The condition_id (first column) is required.
    • It is strongly recommended that the table includes a donor_id column to link the donor id between tables.
  • Activity Table: Contains details on different types of activities used to generate or process data. Example data includes sequencing method, reference assembly, and assay type.
    • The activity_id (first column) is required.
    • Should include a file_id column that references the associated file.
  • Project Table: Contains information about the project the study is a part of. Example data includes funding, and principal investigator. It is strongly recommended that the table includes a title field.
    • The project_id (first column) is required.

Non-standard Data Tables

If your dataset has been accepted by AnVIL and has needs not described here, please refer to the README on “Expanding the data model” for the AnVIL Data Model. Additional tables must be associated with one of the core tables: BioSample, Donor, or File. If you need further assistance, please reach out to the AnVIL Team at help@lists.anvilproject.org. You can work directly with the AnVIL Phenotype WG and Data Ingest WG to integrate into one of these existing data models.

Suggested Proprietary Coding Systems

TableConcept(s)Coding systems
DonorOrganism type label, Organism type coding systemNCI Organismal Classification
DonorDonor typeDonor type values
DonorPhenotype sex codePhenotype sex coding system
DonorReported ethnicity label, Genetic ancestry labelSuggested Ancestry Values
DonorReported ethnicity coding system, Genetic ancestry coding systemGenetic ancestry coding system
BioSampleAnatomical site labelAnatomical Site Allowed Values
BioSampleAnatomical site coding systemAnatomical site coding system
BioSampleA priori cell type labelA priori cell type suggested values
BioSampleA priori cell type coding systemA priori cell type
BioSampleBioSample type labelBioSample type values
BioSampleBioSample type coding systemBioCore Terms
BioSamplePrimary condition labelSuggested primary condition label values
BioSamplePrimary condition coding system
BioSamplePrimary condition affected statusBoolean true/false
ActivityActivity type labelActivity type values
ActivityActivity type coding systemBioCore Values

2.2.2 - OMOP Common Data Model

To support FAIR data management on the AnVIL, Data Submitters who have longitudinal clinical data are recommended to submit their data in the OMOP CDM, instead of the AnVIL Data Model. Submitters may read more about the OMOP CDM and the wider OHDSI initiative here: https://ohdsi.github.io/CommonDataModel/background.html

Submitters are encouraged to review publicly available OMOP specs and choose a CDM version that best fits their data needs:

For differences between OMOP 5.3 and 5.4, submitters may refer to this link: https://ohdsi.github.io/CommonDataModel/cdm54Changes.html

Submitting in OMOP

Please submit your OMOP tables in your elected CDM version following these requirements:

  • Copy and complete the [Template] OMOP Submission Form for your data
  • Submit your OMOP tables as tsv files
    • Provide these minimum required OMOP Domain tables: Person (please do not include PII in your submission), Death, Visit_Occurrence, Visit_Detail, Condition_Occurrence, Drug_Exposure, Procedure_Occurrence, Device_Exposure, Measurement, Observation, Specimen
      • If there is no data available for a specific, required domain table, do not include the table in your submission and indicate it is “Not Available” in the manifest
      • Name your OMOP table files according to the domain names
        • Example: person.tsv; observation.tsv
    • Due to licensing constraints, the OMOP vocabulary cannot be redistributed and must be excluded from your submission
      • However, you may have local codes that you are using to represent your data!
      • Please provide your local codes in the vocabulary, concept, concept_relationship, and concept_ancestor tables, following OMOP specifications for your CDM version indicated and exclude the OMOP vocabulary from your submission

We will validate your OMOP dataset using your version number at submission. As a part of your submission process, we will return a data quality report to you. This report will include information about adherence to OMOP standard specs and contain basic data characterization. We will also make the final report available to research users.

2.3 - Generate Your Data Dictionary

Data Dictionaries are vital for AnVIL operational and AnVIL scientific consumers to understand and utilize the data being submitted. To support consistent ingest, storage, and downstream usage, AnVIL requires that all data submissions include a Data Dictionary.

For a template Data Dictionary, click here.

Data Dictionary Validation

Data Dictionaries provided to AnVIL will be validated for both completeness and clarity of information and alignment with the data included in the data submission. The ideal data dictionary is one that has an entry for every relevant field in the data dictionary template, the descriptive elements of those fields are precise and clear enough that even those wholly unaffiliated with the study can understand their purpose, and the provided information reflects the actual state of the data included with the submission.


Help us make these docs great!
All AnVIL docs are open source. See something that’s wrong or unclear? Submit a pull request.
Make a contribution
NHGRINIHHHSUSA.GOV
HelpPrivacy
v2.29.0-0fc05c3