Step 2 - Set Up a Data Model
As you get started, we recommend you review the AnVIL Data Model Data Dictionary here to learn more about the general structure (which we will describe in more detail below). You’ll coordinate with the AnVIL data ingest team to facilitate submission activities.
If your dataset has been accepted by AnVIL and does not easily fit into an existing template, please reach out to the AnVIL Team at help@lists.anvilproject.org.
You’ll end this step with a better understanding of what your data dictionary will include for your data model. Step 3 will instruct you on preparing your data according to your data model for submission to AnVIL.
2.1 - First Steps
Coordinate with the AnVIL Data Ingest Team
To collaborate with AnVIL on uploading data, you’ll notify AnVIL of completed dbGaP registration in one of two ways.
- Through an already open AnVIL Zendesk ticket
- By contacting AnVIL support at anvil-data@broadinstitute.org
It is useful to coordinate with AnVIL before you start to set up your data model, in case you run into questions or problems.
2.2 - Create Your AnVIL Data Model
A data model explicitly determines the structure of data. It organizes data elements and standardizes how the data elements relate to one another.
We encourage you to read the README and AnVIL Table Overview tabs in the AnVIL Data Model data dictionary as you begin creating your AnVIL Data Model.
These tabs include helpful information such as
-
The overall purpose of the data dictionary
-
Information on the AnVIL Data Model (findability subset)
-
Details on how to expand the data model to fit your needs
-
Expectations on how to submit your schema
Data Model Requirements
The linked Data Dictionary provides context for mapping data to the AnVIL core findability model. It includes information on the tables to be included, which concepts should be included in each table, and suggested coding systems for each. The Data Dictionary is a learning resource to assist users with creating their own AnVIL Data Model for submission.
Please read the descriptions for each table outlined in the Data Dictionary carefully, and reach out to your AnVIL team contact with questions not addressed in the data dictionary document. These data model requirements help ensure AnVIL datasets are not only useful to the researchers who created them but also enable others to analyze data collectively across studies in the AnVIL Terra platform.
Start by thinking of what data you have and how you have already organized it and how it may fit or need to be reorganized to fit the requirements.
Core Tables for All Studies
At a high level, there are three core tables (“entities”) in the AnVIL Data Model: BioSample, Donor, and File.
The BioSample table is the only required table, although Donor and File tables are strongly encouraged to improve the usability and findability of the data.
The specifications for each table, including required fields, strongly recommended fields, field names, and field descriptions, are included in the Data Dictionary. Brief descriptions of these core tables, with notable requirements, are below.
- BioSample (required): Contains information about the sample(s) included in the study. Example data types include Anatomical Site from which the biosample was taken and the cell type.
- The
biosample_id
(first column) is required. - If a File table will not be submitted, the BioSample table must contain a column indicating which samples correspond to which files.
- The
- Donor (strongly recommended): Contains demographic and phenotypic information about the donor. Example data types include phenotypic sex, reported ethnicity, and genetic ancestry.
- The
donor_id
(first column) is required.
- The
- File (strongly recommended): Contains information for files associated with the study. Example data includes DRS ID, filename, and file type.
- The
file_id
(first column) is required. - It is strongly recommended that the table includes a
BioSample.biosample_id
column to link the biosample id between tables. - AnVIl will add DRS URIs for object files as part of the ingest process.
- The
Optional Tables
The AnVIL Data Model Data Dictionary includes other optional tables you can use to contain data about conditions, activity, and your project. Brief descriptions of these optional tables, with notable requirements, are below.
- Condition Table: Contains information about condition(s) and phenotypes associated with a donor.
- The
condition_id
(first column) is required. - It is strongly recommended that the table includes a
donor_id
column to link the donor id between tables.
- The
- Activity Table: Contains details on different types of activities used to generate or process data. Example data includes sequencing method, reference assembly, and assay type.
- The
activity_id
(first column) is required. - Should include a
file_id
column that references the associated file.
- The
- Project Table: Contains information about the project the study is a part of. Example data includes funding, and principal investigator. It is strongly recommended that the table includes a title field.
- The
project_id
(first column) is required.
- The
Non-standard Data Models
If your dataset has been accepted by AnVIL and has needs not described here, please refer to the README on “Expanding the data model”. Additional tables must be associated with one of the core tables: BioSample, Donor, or File. If you need further assistance, please reach out to the AnVIL Team at help@lists.anvilproject.org. You can work directly with the AnVIL Phenotype WG and Data Ingest WG to integrate into one of these existing data models.
2.3 - Generate Your Data Dictionary
All AnVIL studies must submit a Data Dictionary table (spreadsheet file) that defines your complete data model. It includes (in separate tabs for each table) field names, field descriptions, field types, examples, enumeration values (where applicable), and multi-value delimiter symbols used (where applicable) for each table in the data model.
For a template Data Dictionary with all required and suggested tables, click here. To download the AnVILDataSubmissionFindabilitySubsetSchema.template.xlsx file, click on the three-dot icon at the top right and then click Download.
Suggested Proprietary Coding Systems
Table | Concept(s) | Coding systems |
---|---|---|
Donor | Organism type label, Organism type coding system | NCI Organismal Classification |
Donor | Donor type | Donor type values |
Donor | Phenotype sex code | Phenotype sex coding system |
Donor | Reported ethnicity label, Genetic ancestry label | Suggested Ancestry Values |
Donor | Reported ethnicity coding system, Genetic ancestry coding system | Genetic ancestry coding system |
BioSample | Anatomical site label | Anatomical Site Allowed Values |
BioSample | Anatomical site coding system | Anatomical site coding system |
BioSample | A priori cell type label | A priori cell type suggested values |
BioSample | A priori cell type coding system | A priori cell type |
BioSample | BioSample type label | BioSample type values |
BioSample | BioSample type coding system | BioCore Terms |
BioSample | Primary condition label | Suggested primary condition label values |
BioSample | Primary condition coding system | |
BioSample | Primary condition affected status | Boolean true/false |
Activity | Activity type label | Activity type values |
Activity | Activity type coding system | BioCore Values |
Additional Resources
Contact information
AnVIL Data Ingest Team anvil-data@broadinstitute.org
AnVIL Help Team help@lists.anvilproject.org