NHGRI Analysis Visualization and Informatics Lab-space

Learn

IntroductionData AnalystsInvestigatorsData Submitters

Step 5 - QC Data

After submission, you will evaluate genomic data (ex. BAMs or CRAMS) for basic sequence yield and quality control (QC) metrics. These metrics ensure depth and breadth of coverage requirements are met for all data ingested into AnVIL.

AnVIL Data Processing Working Group has created a genomic evaluation tool for whole genome data (a whole exome QC tool is in development). You will collect quality control metrics for genome and exome sequencing data by running the tool - a workflow written in Workflow Description Language - in a sandbox workspace.

The WDL includes multiple software packages (Picard, VerifyBamID2, Samtools flagstat, bamUtil stats ) organized in a single, efficient tool that is compatible with AnVIL.

The current QC pass/fail status is based on three metrics: coverage, freemix, and sample contamination. QC metrics can be made available in the AnVIL workspace to aid users in sample selection.

QC processing results table

Below is the current output, generated by the workflow in a qc_results_sample data table.

Metric NameMetric DescriptionPass thresholdPurposeSource Tool
qc_results_sample_idSample IDNAIdentify sampleNA
cramCram google pathNALocate fileNA
FREEMIXFREEMIX< 0.01Sample contaminationVerifyBamID2
MEAN_COVERAGEHaploid Coverage≥ 30Coverage depthPicard CollectWgs Metrics
MEDIAN_ABSOLUTE_DEVIATIONLibrary insert size madNABatch characteristicsPicard CollectInsertSize Metrics
MEDIAN_INSERT_SIZELibrary insert size medianNABatch characteristicsPicard CollectInsertSize Metrics
PCT_10X% coverage at 10X> 0.95Coverage breadthPicard CollectWgs Metrics
PCT_20X% coverage at 20X> 0.90Coverage breadthPicard CollectWgs Metrics
PCT_30X% coverage at 30XNAAdditional metadataPicard CollectWgs Metrics
PCT_CHIMERAS (PAIR)% Chimeras< 0.05Variant detectionPicard CollectAlignmentSummary Metrics
Percent_duplication% duplication
Q20_BASESTotal bases with Q20 or higher≥ 86x109Sequence qualityPicard CollectQualityYield Metrics
qc_statusReported status at the sample levelPass/Fail/No QCOverall quality assessment
read1_pf_mismatch_rateRead1 base mismatch rate< 0.05Sequence qualityPicard Collect Alignment Summary Metrics
read2_pf_mismatch_rateRead2 base mismatch rate< 0.05Sequence qualityPicard Collect Alignment Summary Metrics

5.1 Select QC status criteria

Data submitters should establish the specific metrics and thresholds for determining the pass/fail criteria on their dataset.

5.2 Run QC Processing

Data Submitters are responsible for running the WDL on their data to generate the QC metrics. AnVIL Data Processing Working Group has created QC aggregator Jupyter notebook. Once QC status criteria have been determined, the thresholds can be modified in the notebook. The criteria is used to assign QC status of pass or fail. If a sample fails multiple times, it is assigned No QC under QC status.

Video - Walkthrough of WGS QC Processing

5.3 Post QC Processing to AnVIL Workspaces

The output from the QC aggregator is a QC summary results TSV file. Data submitters will pass off the QC summary results file to AnVIL ingestion team. The AnVIL team will push the QC summary results to the workspaces, which will contain the QC status including those that fail QC or have no QC. The example below is the QC results table in 1000 Genomes workspace.

Sample QC Results Table

QC Results.
QC results in a 1000 Genomes workspace

Additional Resources - Upcoming AnVIL Tools

AnVIL Data Processing Working Group is evaluating two tools to add to the submission process to estimate (genetic) sex and compare that to reported sex. The goal is to identify at a cohort level any major issues between the genomic data and the reported phenotype data. Variation in sex chromosome copy number (e.g., XXY, XO, somatic mosaicism) means that genetic sex prediction is not 100% accurate, although it is an excellent tool for detecting major cohort-level issues.

Exome QC Processing

Coming soon

Sex Check

Coming soon

4 - Ingest DataConsortium Data Access Guidelines
Improve this pageContent guide