AnVIL Portal

Step 5 - QC Data


Data Submitter QC Guidance

Data Submitters are responsible for the definition, development, and maintenance of a quality control process to ensure both the integrity and quality of the data submitted. Although QC resources are provided, it is the responsibility of data submitters to ultimately decide what is most appropriate for the data submitted to the AnVIL.

QC Process Overview

Data Submitters are responsible for defining and running the workflow (WDL) on their genomic data to generate the QC metrics (see Data Submitter Guidance section above). Once QC status criteria have been determined, a thresholds file can be added to the workspace to use as a workflow input. The criteria are used to assign QC status of pass or fail. If a sample fails multiple times, it is assigned No QC under QC status. The evaluation of the thresholds will be output as an additional PASS/FAIL column, overall_evaluation, by the workflow.

Video - Walkthrough of WGS QC Processing

AnVIL QC Recommendations

  • Use a reproducible workflow definition, e.g., Workflow Definition Language (WDL) or other definition language(s) supported by the AnVIL platform. Additionally, data submitters could define and document an equivalent reproducible process to generate the QC metrics to be evaluated.

  • Use a data dictionary to share the expectations for metrics collected, values reported or displayed for evaluation, as well as any evaluation criteria to determine a status, e.g., pass/fail/noqc, for the submitted data.

  • Establish the specific metrics and thresholds for determining the pass/fail criteria on the dataset.

  • The output of the entire QC process should be compatible with the AnVIL data repository components such as Terra workspace data tables and/or Data Repository to enable importing QC metrics and statuses in the AnVIL workspaces.

Example QC Resource

AnVIL Data Processing Working Group has created a genomic evaluation tool for whole genome and exome data. To collect quality control metrics for genome and exome sequencing data, rdata submitters will run the tool - a workflow written in Workflow Description Language - in a sandbox workspace.

QC WDL details

  • The WDL includes multiple software packages (Picard, VerifyBamID2, Samtools flagstat, bamUtil stats, Rx estimation) organized in a single, efficient tool that is compatible with AnVIL.

  • The WDL workflow is publicly available via GitHub (https://github.com/genome/qc-analysis-pipeline/tree/master) and Dockstore (https://dockstore.org/organizations/anvil/collections/qcwgs).

  • The AnVIL will accept edits, commits, issues, and pull requests in GitHub.

  • Support and guidance are provided by the AnVIL for data submitters on the use of or adaptation of the existing QC workflow resource.

QC inputs and outputs

Example QC Metrics

QC processing results table

Below are the current output, files, and discrete values generated by the workflow in a qc_results_sample data table: https://github.com/genome/qc-analysis-pipeline/blob/master/docs/outputs.md

Metric NameMetric DescriptionPass thresholdPurposeSource Tool
qc_results_sample_idSample IDNAIdentify sampleNA
cramCram google pathNALocate fileNA
FREEMIXFREEMIX< 0.01Sample contaminationVerifyBamID2
MEAN_COVERAGEHaploid Coverage≥ 30Coverage depthPicard CollectWgs Metrics
MEDIAN_ABSOLUTE_DEVIATIONLibrary insert size madNABatch characteristicsPicard CollectInsertSize Metrics
MEDIAN_INSERT_SIZELibrary insert size medianNABatch characteristicsPicard CollectInsertSize Metrics
PCT_10X% coverage at 10X> 0.95Coverage breadthPicard CollectWgs Metrics
PCT_20X% coverage at 20X> 0.90Coverage breadthPicard CollectWgs Metrics
PCT_30X% coverage at 30XNAAdditional metadataPicard CollectWgs Metrics
PCT_CHIMERAS (PAIR)% Chimeras< 0.05Variant detectionPicard CollectAlignmentSummary Metrics
Percent_duplication% duplication
Q20_BASESTotal bases with Q20 or higher≥ 86x109Sequence qualityPicard CollectQualityYield Metrics
qc_statusReported status at the sample levelPass/Fail/No QCOverall quality assessment
read1_pf_mismatch_rateRead1 base mismatch rate< 0.05Sequence qualityPicard Collect Alignment Summary Metrics
read2_pf_mismatch_rateRead2 base mismatch rate< 0.05Sequence qualityPicard Collect Alignment Summary Metrics

Post QC Processing to AnVIL Workspaces

The output from the QC aggregator is a QC summary results TSV file. Data submitters will pass off the QC summary results file to the AnVIL ingestion team. The AnVIL team will push the QC summary results to the workspaces, which will contain the QC status, including those that fail QC or have no QC. The example below is the QC results table in the 1000 Genomes workspace.

Sample QC Results Table

QC Results.
QC results in a 1000 Genomes workspace

Help us make these docs great!
All AnVIL docs are open source. See something that’s wrong or unclear? Submit a pull request.
Make a contribution