NIH Cloud Platform Interoperability Effort
Helping to create a federated genomic data ecosystem
The NCPI was created as an outcome of the NIH Workshop on Cloud-Based Platforms Interoperability held at RENCI on October 3-4th, 2019 to facilitate interoperability among the genomic analysis platforms established by the NCI, NHGRI, NHLBI, and the NIH Common Fund.
The NCPI is a collaboration between NIH representatives, platform team members, and researchers running cross-platform research efforts to inform and validate the interoperability approaches.
The NCPI's participating platforms are:
AnVIL - The NHGRI Genomic Data Science The Genomic Analysis, Visualization, and Informatics Lab-space, or AnVIL, is NHGRI's genomic data resource that leverages a cloud-based infrastructure for democratizing genomic data access, sharing, and computing across large genomic, and genomic-related data sets. [more]
BioData Catalyst - NHLBI BioData Catalyst is a cloud-based platform providing tools, applications, and workflows in secure workspaces. By increasing access to NHLBI datasets and innovative data analysis capabilities, BioData Catalyst accelerates efficient biomedical research that drives discovery and scientific advancement, leading to novel diagnostic tools, therapeutics, and prevention strategies for heart, lung, blood, and sleep disorders. [more]
Cancer Research Data Commons - The goal of the National Cancer Institute’s Cancer Research Data Commons (CRDC) is to empower researchers to accelerate data-driven scientific discovery by connecting diverse datasets with analytical tools in the cloud. The CRDC is built upon an expandable data science infrastructure that provides secure access to many different data across scientific domains via Data Commons Framework. [more]
Kids First Data Resource Center - The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” [more]
National Center for Biotechnology Information - The National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) hosts and manages the Database of Genotypes and Phenotypes (dbGaP) and NIH’s Sequence Read Archive (SRA). dbGaP provides and manages access to protected data related to human studies that have investigated the interaction of genotype and phenotype. SRA is the largest archive for public and controlled-access next-generation sequencing data. [more]
The NCPI has intentionally constrained the problems we are addressing to those achievable in the near term that can demonstrate value to researchers by enabling specific research projects.
We are currently focused on the deliverables listed below.
Generic Search Results Hand-off
We are working to establish a generic and universal hand-off mechanism so data portal users can further analyze search results on any analysis platform that supports the format.
This will allow data portals to develop and maintain a single “export mechanism” which would be available to any analysis platforms that invested in supporting the standard format. Importantly, this gives researchers greater freedom in how and where they compute.
By improving the hand-off of search results from portals to workspace environments through standardization, we will enable researchers to query on multiple portals and aggregate their search results to a common cloud workspace of their choosing in order to perform an analysis.
For example, this will let a researcher search for Kids First and TOPMed data on their respective portals and then take the results to the Terra environment on either AnVIL or BioData Catalyst where they can perform a joint analysis on these data.
Currently, this simple scenario has limited or no support across portals and analysis workspaces, making this type of joint analysis impossible for most users.
Demo of Search Result Hand-off
A demonstration of accessing data from four participating platforms in a single computational workspace.
Single Sign-On Pilot with NIH RAS
In collaboration with the NIH CIT Researcher Auth Service (RAS) Initiative, we will pilot a single sign-on authentication/authorization workflow.
Cross-Platform Data Discovery
To make the most effective use of the data managed by NCPI platforms, users must be able to view, browse, and search datasets available across all resources. This would allow biomedical researchers to understand what data is already available. This, in turn, will allow for better experimental design of future studies and will prevent duplication of current and past efforts.
To allow such a "bird's eye view" of the data we are building the NCPI Dataset Catalog. Initially, the catalog will use infrastructure built by the AnVIL project to generate an overview of available datasets.
Cross-Platform Research Efforts
There are currently six cross-platform research efforts:
- Three research efforts integrate data from BioData Catalyst and the Kids First DRC.
- Two research efforts integrate data from CRDC and AnVIL.
- One research effort integrates data across Anvil, Kids First DRC, and BioData Catalyst.
For more information on the research efforts and their related use cases please see the Research Use Cases section of the Systems Interoperation Working Group charter.
The NCPI has five working groups:
Community Governance Working Group - Working to establish a set of principles for promoting interoperability across multiple platforms to remove operational barriers to trans-platform data sharing and analysis.
Coordination Working Group - Coordinating discourse, collaboration, and meetings between the working groups of NCPI.
FHIR Working Group - Assessing the potential of FHIR resources to model and share complex clinical and phenotypic data.
Outreach and Training Working Group - Creating a public knowledge base with training materials and a cloud cost guide to educate researchers on the research use cases enabled by interoperable cloud-based data commons.
Systems Interoperation Working Group - Testing and implementing technical standards (e.g. GA4GH APIs) for data exchange and demonstrating their effectiveness in enabling key cross-platform research use cases.
The NCPI holds workshops every six months for working group members to provide progress updates and discuss priorities for the next six-month period. See the Progress Updates section for workshop recordings and meeting notes.