Large cohort genome projects

One of the greatest concerns facing these genome projects is that, once the raw read data are analyzed, their volume makes it very computationally expensive to reanalyze them to search for particular variants. By keeping all of the raw read data in a BAF, it is always possible to go back and perform further analyses rapidly, even when the data are very large. For example, if there were approximately 30,000 putative SV sites to be queried, SV-typing these sites across 10,000 HiSeq X WGS samples in BioGraph Analysis Format would require less than 100 TB of storage, 166 CPU hours, and 2.5 total hours.

Further, because of the fuzzy nature of the breakpoints for structural variants, it is very difficult to do cross-sample comparisons between individuals on those variants. When using the BioGraph Format, because it uses whole read overlap, the breakpoints are exactly the same as, or very close to, the actual breakpoints (see results of Anchored Assembly below). Further, when constructing a multi-individual graph, a natural coordinate system emerges, allowing for rapid comparison of all paths, including structural variants. By directly comparing groups and individuals, there are fewer errors in identifying inter-individual differences that could be associated with disease.

Additionally, a human reference can be a path through the graph. This means that existing and future references can be added as paths through the graph.

 

Clinical applications

In the clinic, it is important to have accurate results. Using BioGraph Format, it is possible to directly compare the genome of the patient to other samples rapidly in contrast to comparing the results of variant calling. This will be of particular interest in the analysis of trios, where it is possible to search for de novo variants by comparing to the parents or even querying whether the putative de novo variants are in the population. Further, if there are particular variants that could explain an individual’s condition, the entire read set can be very rapidly searched for the sequence that would explain the presence of that variant. Finally, it is possible to use it to double check the read evidence for variants detected using a traditional caller. Overall, these options allow bioinformaticians who support clinicians the ability to be able to verify the results that will often be relied upon.

The nature of the BioGraph Format means that there are a large number of applications, many of which are not covered here. Please contact the Spiral Genetics team to discuss whether the BioGraph Format could create better results in your study.