Directly compare tens of thousands of whole genomes with the BioGraph™ Format.
BioGraph retains all of the sequence data for every sample and uses the redundancy and relation between samples to reduce the overall size of the joint dataset. Unlike reference based compression methods, this technology allows for rapid comparison at scale, structural variation detection, and data compression.
Harmonize Disparate Data
No matter which analysis pipeline was originally used to analyze data, BioGraph will harmonize it for uniform analysis. Force calling allows you to take variant calls generated from multiple variant callers and check for evidence of those calls in any of your samples and against the reference that is most convenient for your study.
Rapid Iterative Reanalysis
Add new samples with ease.
When conducting large studies, samples come in over time. Adding these new samples to the set can require significant reanalysis. With BioGraph, the raw read data can be searched at the rate of over 100,000 queries a second. Using the API, you can design queries to answer nearly any specific question.
EVIDENCE FOR ALL VARIANTS
Be confident in your variant calls.
The BioGraph Format retains all read information, including phasing information. This includes all read data indicating structural variants. If there is evidence of a particular variant in the read data, you can identify it in a fraction of a second.
The Underlying Technology
The Spiral Genetics BioGraph™ Toolkit is an advanced, indexed, graph-based suite of products for genomic data management, analysis, and discovery. The BioGraph Analysis Format is a method of storing NGS read data using a whole read overlap assembly that has been indexed with advanced technology.
- Incorporates all read data, across hundreds of individuals using a read overlap graph.
- This read overlap graph can be queried at the rate of over 100,000 queries a second on just 8 cores.
- With the API, different tools can be scripted to identify anything from gene fusions, and de novo and structural variants, to the allele frequency of SNPs across 1000 individuals.
- The read data for multiple individuals can be stored in less than 3GB per individual.
This format allows for very rapid query of all the reads, not just those aligned to the reference. It is a graph representation, making it possible to directly compare between samples, even for areas that have not been mapped to the reference.
Essentially, the approach includes a method of preprocessing input files and putting them into a graph format, which both captures the sequence reads in their entirety and makes them much easier to search. Unlike the standard approach for detecting variants, which aligns input sequences to a reference, looks for mismatches, and then generates a list of variants leaving out information about the reads from which the variant calls were made, in BioGraph reads are indexed in stacked graphs, which makes it easy for users to search for variations by following paths through the graphs.
How is the BioGraph Format different?
A number of other graph-based methods seek to take variant calls and arrange them in a graph-based structure to capture variation. But because the variation originates from variant calls, these graphs are susceptible to the same false discoveries and lower sensitivity observed in the variant callers that created them. Further, it may not naturally contain phasing information, sometimes leading to an explosion of the graph structure at locations with many variants close together.
Encoding into the BioGraph Format
The BioGraph Format, can be generated from a regular BAM file. For a 30x coverage whole human genome, a BioGraph file can be produced in between 20 and 30 hours by a 24-core machine with 64 gigabytes of memory. The resulting file is approximately 10GB and stores all the reads from the original data. When combining graphs from multiple whole human genomes, they can be added at the rate of approximately one per hour. Once some 50 individuals are merged together, the read data can be stored using approximately 3GB per individual.