One role of the Kids First DRC is to produce harmonized datasets - files that can be compared across the different Kids First studies in the field of pediatric cancer and structural birth defects. These harmonized datasets are produced by a set of Kids First bioinformatic workflows . Running genomic files from different studies through the same workflows produces called variants that can be directly compared with one another in support of our mandate for cross-disease research.
The Kids First DRC has recently completed a large-scale update to our Somatic Variant Workflow , a pipeline for identifying variants within tumors in our cancer studies. All somatic files released on the Kids First Portal in the future will be produced using this latest version of the pipeline. We have also replaced some existing files on the Kids First Portal with new versions generated by this workflow. The improvements made to the workflow will better support the use cases of Kids First researchers in their discovery and analysis efforts.
Three studies on the Kids First Portal will be affected by the changes to the Somatic Variant Workflow.
- Kids First: Neuroblastoma (KF-NBL)
- Pediatric Brain Tumor Atlas: CBTTC (PBTA-CBTN)
- Pediatric Brain Tumor Atlas: PNOC (PBTA-PNOC2)
Each of these studies was run on the previous version of the Somatic Variant Workflow. We have rerun these through the new workflow, producing new files that have been released on the Kids First Portal.
Files produced from the previous version of the workflow have been retired and removed from the Portal. We do not recommend using these old files for future research. The improvements made to the workflow include better filtering of germline variants and highlighting variant hotspots, details of which are provided below.
The Kids First DRC’s Somatic Variant Workflow is a comprehensive pipeline which includes multiple callers for single nucleotide variants (SNVs), copy number variation (CNVs), and structural variants (SVs).
Somatic SNVs: Strelka2, Mutect2, Lancet, VarDict Java
Somatic CNVs: ControlFreeC, CNVkit
Somatic SVs: Manta
The two significant changes made to the somatic workflow are the inclusion of germline masking and an annotation subworkflow .
Germline masking is a filtering step which better annotates variants to distinguish germline variants (variants within an individual’s germline or “healthy, normal” cells) from somatic variants (novel mutations which arise within a tumor sample). In the new workflow, we have added a field to VCFs to tag variants that have an overall population allele frequency of
>= 0.001 and/or a normal read depth
<= 7 reads. The inclusion of this read depth filter is explained by Dr. Nathan Johnson, an engineer who helped build the workflow:
“The scenario is, you’ve got a seeming somatic variant, with the normal-sample reads all supporting the reference allele. But if there’s only maybe 5 of those reads, a very small number, then there’s a decent chance that it’s a germline heterozygous variant and you just didn’t get lucky enough to see any alt reads in the normal sample. A prior understanding that heterozygous germline variants are much more common than somatic variants informs this.”
In a test of the new workflow looking at more than 1,000 samples and 7.76 million variant calls, the germline masking filter labeled 14.3% of calls as either high allele frequency (14.2%), low depth (0.05%), or both (0.05%). This filtering both improves the quality of the output, in that remaining variants are more likely to be somatic, as well as the safety of the output, reducing the risk that a germline variant could be revealed in publicly-accessible data.
The annotation subworkflow is a series of labeling steps that improves the utility of the output VCFs by giving context to the possible biological consequences of each variant. The VCFs from each of the SNV callers are normalized within themselves, allowing for more accurate variant-by-variant comparisons across these multiple tools. They are then annotated with several tools, including Variant Effect Predictor (VEP), gnomAD to add allele frequencies, and cancer hotspots, a set of statistically-significant mutations identified in other cancer genomics datasets.
Hotspot annotation provides researchers a subset of already studied variants to investigate immediately before a deeper exploration. Hotspot annotation also factors into how variants are filtered. Variants that are only identified by one of the four SNV callers are not included in the consensus VCF unless they are also annotated as a hotspot. In the same test described above, 761 of the 7.76 million variants were labeled as hotspots (about 1 in 10,000), 71 of which were only identified by a single caller.
Output from the somatic workflow will include multiple VCFs per sample. We recommend using the consensus VCF as each caller has its own approach to variant calling; combining the results together from multiple callers provides results that are well balanced for sensitivity and specificity. VCFs from each caller are also provided if you prefer these for these for your analysis.
VCFs are also categorized as either
protected . Public VCFs will be open-access files that have the variants identified through germline masking completely removed. Protected VCFs will be controlled-access files available through dbGaP that have germline masked variants labeled, but still included in the output files.
As open-access variant files are replaced on the Kids First Portal, some variants may not appear in the latest public version. These were likely germline variants, not somatic variants. These variants are still available in the protected VCFs, accessible to those who apply for access through dbGaP.
If you have questions about changes to our Kids First Somatic Workflow, or any of our standard deliverables, please respond below. We look forward to continuing to support your research!