Step 4: Stage Your Data in AnVIL
You’ll work with a designated POC at the AnVIL team to shepherd the data (omic data and image files and TSV load files) into the deposit workspace (and upltimately the AnVIL data storage repository). Note that because each engagement will most likely be different, we will be further developing and refining (as needed) processes as we engage with submitters.
For actions taken prior to final process refinement, all transfers should involve the Data Processing WG and AnVIL team to ensure data integrity during the transfer process.
Process overview
- Log into AnVIL You can use either a Google or Microsoft ID for SSO to access your assigned data deposit workspace on anvil.terra.bio.
- Set up your workspace cloud storage To facilitate ingestion into TDR, the workspace cloud storage must have a particular directory structure.
- Upload data Last, you'll upload unstructured data files (omics files, images, etc.) to the data_files folder or sub-folder.
- Validation (done by AnVIL ingestion team)
Step-by-Step Instructions
For details, see How to stage data in your AnVIL deposit workspace.
Next steps (done by ingestion team)
After you stage the data in the deposit workspace, the AnVIL team will perform these pre-ingestion operations.
Validation (automated)
These tests will be executed once data has been ported to AnVIL.
-
QC check of submission form to genomic object files Check that the number of files match the number in the submission Google form
-
QC check of phenotype and metadata Make sure the phenotype file data fields match the defined data model and sample IDs are consistent with phenotype and linked to a subject and consent.
-
Ingestion Validation (automated) To confirm the ingested data transferred as expected and maintain the file integrity, Google automatically checks the md5 sum of the end file against the original after each file transfer.
Data Indexing (genomic object files)
Once in the workspace buckets, object files, such as sequencing data, are indexed with a global unique ID (GUID). This allows access across AnVIL tools without requiring copies to be created and transferred across environments.
These identifiers allow for tracking data across components of AnVIL and facilitate the ability to interoperate with other data commons due to their extensibility. Further, they enable tracking of live data processed in workflow pipelines and data backup to cold storage.