Highlights

The Spiral Genetics BioGraph™ Toolkit is an advanced, indexed, graph-based suite of products for genomic data management, analysis, and discovery.  The BioGraph Analysis Format is a method of storing NGS read data using a whole read overlap assembly that has been indexed with advanced technology.

  • Incorporates all read data, across hundreds of individuals using a read overlap graph.
  • This read overlap graph can be queried at the rate of over 200,000 queries a second on just 8 cores.
  • With the API, different tools can be scripted to identify anything from gene fusions, and de novo and structural variants, to the allele frequency of SNPs across 1000 individuals.
  • The read data for multiple individuals can be stored in less than 3GB per individual.

This format allows for very rapid query of all the reads, not just those aligned to the reference. It is a graph representation, making it possible to directly compare between samples, even for areas that have not been mapped to the reference.

Essentially, the approach includes a method of preprocessing input files and putting them into a graph format, which both captures the sequence reads in their entirety and makes them much easier to search. Unlike the standard approach for detecting variants which aligns input sequences to a reference, looks for mismatches , and then generates a list of variants leaving out information about the reads from which the variant calls were made, in BioGraph reads are indexed in stacked graphs which makes it easy for users to search for variations by following paths through the graphs.

How is the BioGraph Format different?

A number of other graph-based methods seek to take variant calls and arrange them in a graph-based structure to capture variation. But because the variation originates from variant calls, these graphs are susceptible to the same false discoveries and lower sensitivity observed in the variant callers that created them. Further, it may not naturally contain phasing information, sometimes leading to an explosion of the graph structure at locations with many variants close together.

Encoding into the BioGraph Format

The BioGraph Analysis Format, BAF, can be generated from a regular BAM file. For a 30x coverage whole human genome, a BAF file can be produced in between 20 and 30 hours by a 24 -core machine with 64 gigabytes of memory. The resulting file is approximately 10GB and stores all the reads from the original data. When combining graphs from multiple whole human genomes, they can be added at the rate of approximately one per hour. Once some 50 individuals are merged together, the read data can be stored using approximately 3GB per individual.