Directly compare tens of thousands of whole genomes with the BioGraph™ Format.

BioGraph retains all of the sequence data for every sample and uses the redundancy and relation between samples to reduce the overall size of the joint dataset.  Unlike reference based compression methods, this technology allows for rapid comparison at scale, structural variation detection, and data compression.

 

Harmonize Disparate Data

Collaboration simplified.

No matter which analysis pipeline was originally used to analyze data, BioGraph will harmonize it for uniform analysis.  Force calling allows you to take variant calls generated from multiple variant callers and check for evidence of those calls in any of your samples and against the reference that is most convenient for your study.

Rapid Iterative Reanalysis

Add new samples with ease.

When conducting large studies, samples come in over time.  Adding these new samples to the set can require significant reanalysis.  With BioGraph, the raw read data can be searched at the rate of over 100,000 queries a second. Using the API, you can design queries to answer nearly any specific question.

EVIDENCE FOR ALL VARIANTS

Be confident in your variant calls.

The BioGraph Format retains all read information, including phasing information. This includes all read data indicating structural variants. If there is evidence of a particular variant in the read data, you can identify it in a fraction of a second.


The Underlying Technology

BioGraph Overview

The Spiral Genetics BioGraph™ Toolkit is an advanced, indexed, graph-based suite of products for genomic data management, analysis, and discovery.  The BioGraph Analysis Format is a method of storing NGS read data using a whole read overlap assembly that has been indexed with advanced technology.

  • Incorporates all read data, across hundreds of individuals using a read overlap graph.
  • This read overlap graph can be queried at the rate of over 100,000 queries a second on just 8 cores.
  • With the API, different tools can be scripted to identify anything from gene fusions, and de novo and structural variants, to the allele frequency of SNPs across 1000 individuals.
  • The read data for multiple individuals can be stored in less than 3GB per individual.

This format allows for very rapid query of all the reads, not just those aligned to the reference. It is a graph representation, making it possible to directly compare between samples, even for areas that have not been mapped to the reference.

Essentially, the approach includes a method of preprocessing input files and putting them into a graph format, which both captures the sequence reads in their entirety and makes them much easier to search. Unlike the standard approach for detecting variants, which aligns input sequences to a reference, looks for mismatches, and then generates a list of variants leaving out information about the reads from which the variant calls were made, in BioGraph reads are indexed in stacked graphs, which makes it easy for users to search for variations by following paths through the graphs.


How is the BioGraph Format different?

A number of other graph-based methods seek to take variant calls and arrange them in a graph-based structure to capture variation. But because the variation originates from variant calls, these graphs are susceptible to the same false discoveries and lower sensitivity observed in the variant callers that created them. Further, it may not naturally contain phasing information, sometimes leading to an explosion of the graph structure at locations with many variants close together.


Encoding into the BioGraph Format

The BioGraph Format, can be generated from a regular BAM file. For a 30x coverage whole human genome, a BioGraph file can be produced in between 20 and 30 hours by a 24-core machine with 64 gigabytes of memory. The resulting file is approximately 10GB and stores all the reads from the original data. When combining graphs from multiple whole human genomes, they can be added at the rate of approximately one per hour. Once some 50 individuals are merged together, the read data can be stored using approximately 3GB per individual.

Overview of the query capabilities

The BioGraph Format stores NGS read data using a whole read overlap assembly specifically indexed for rapid search.  The architecture of the format is that all reads are put together in a read overlap graph, and all read information is retained, including reads that do not align exactly to the reference. This allows for very rapid query of all reads, not just those aligned to the reference. Additionally all the phasing information available in the data is also retained.

Graphs for multiple individuals can be combined together, so it is possible to directly compare between samples, even for areas that have not been mapped to the reference. Further, multiple references can be added to this combined graph, allowing for sequences to be compared to the most convenient reference.

Once the graph representation is in memory, searching by sequence or location can be done at the rate of over 100,000 queries a second on an 8-core machine, even across hundreds of genomes. Using the API, Python scripts can be created to answer specific scientific questions. Genomes and groups of genomes can be compared directly with each other in a matter of minutes and hours rather than days and weeks.


Queries within a Single Genome

At location X for a particular reference, what sequence is present? What are all the reads that map to that location?

Does Sequence Y appear anywhere in the sample?  In what reads and, if those reads aligned to any reference, where did they align?


Queries across a Set of Genomes

At location X for a particular reference, what are the frequencies of all the sequences that occur?

Does Sequence Y appear in the set of samples?  How many reads for each individual and, if these reads aligned to any reference, where did they align?


Comparisons between Multiple Sets of Genomes

    What sequences are unique in one set of genomes? If reads with this sequence align to any reference, where do they align?

    At what locations are there significant differences in the frequency of a particular allele? A specific query might be: at what locations are there over 90% of a particular allele in group A but less than 10% in group B.

    Comparing SVs called by different callers

    Rare Disease Cases

    Make Custom Reference Genomes

    Large Cohort Genome Projects

    One of the greatest concerns facing large genome projects is that, once the raw read data are analyzed, their volume makes it very computationally expensive to reanalyze them to search for particular variants. By keeping all of the raw read data in the BioGraph Format, it is always possible to go back and perform further analyses rapidly, even when the data are very large. For example, if there were approximately 30,000 putative SV sites to be queried, SV-typing these sites across 10,000 HiSeq X WGS samples in BioGraph Analysis Format would require less than 100 TB of storage, 166 CPU hours, and 2.5 total hours.

    Further, because of the fuzzy nature of the breakpoints for structural variants, it is also very difficult to do cross-sample comparisons between individuals on those variants. When using the BioGraph Format, because it uses whole read overlap, the breakpoints are exactly the same as, or very close to, the actual breakpoints (see BioGraph Assembly Results in the Structural Variants section of our website).  Further, when constructing a multi-individual graph, a natural coordinate system emerges, allowing for rapid comparison of all paths, including structural variants. By directly comparing groups and individuals, there are fewer errors in identifying inter-individual differences that could be associated with disease.

    Additionally, a human reference can be a path through the graph. This means that multiple existing and future references can be added as paths through the graph. Paths can also be output as new reference genomes.


    Interested in trying BioGraph on your data?