FASTQ Retirement

Where did the FASTQs go?

The Kids First DRC is retiring the FASTQ file format for all whole genome sequencing data within Kids First studies on the portal. This is part of our ongoing efforts to provide cross-analyzable data sets across all studies as well as provide more effective operations to support ongoing Kids First data generation efforts.

These unaligned reads represent the original data provided from our sequencing center partners and were originally delivered as a pilot to understand what formats would best support the Kids First community. As the bioinformatics community has seen widespread adoption of BAM and CRAM, by retiring FASTQ support for whole genome data, we can focus our efforts in supporting harmonized alignment and downstream variant calling.

Our decision to no longer support the FASTQ format is consistent with the whole genome data files that the Broad Institute provides as deliverables to the Kids First program, which are CRAMs aligned to hg38. This in turn is consistent with the data deliverables for the Broad Institute’s Clinical Whole Genome Sequencing. Note that this will only affect FASTQs for WGS; FASTQs for RNA-Sequencing will still be supported.

Furthermore, the Kids First DRC BAM and CRAM alignments contain all unmapped reads, so all users will still have access to the full set of reads available for a particular sample. If you require the FASTQ format for your research, our BAMs and CRAMs can be converted back to FASTQ using either samtools or Picard, both of which are available on the CAVATICA platform.

In analyzing the current levels of usage, cost and time, we anticipate that the resources associated with converting back to FASTQ are relatively minor compared to the resources to continue to store the FASTQs. An internal benchmarking experiment converting 150 Kids First CRAMs → FASTQ on CAVATICA had a median runtime of about 2 hours and a cost of about $0.50 per sample. If you are anticipating having to do a mass conversion of Kids First datasets, we would be interested in understanding your use case and potential to support the resulting downstream data files to prevent future recomputation.

Note that standard workflows for BAMs/CRAMs delivered by either the DRC as well as from our current sequencing partners include GATK Base Quality Score Recalibration, so any FASTQ generated from these will have these recalibrated scores, not the original ones derived from the original sequencer. This will result in some binning of low-quality base calls (typically at the ends of reads) to be assigned a quality score of 0. We don’t anticipate this to have a major impact on analysis, as the general recommendation is not to use these lowest quality calls for downstream applications such as variant prediction. For further reading on the binning of quality scores, see this recent whitepaper by Illumina.

Feel free to post additional questions or comments below!