The BioGraph Format is a method for storing NGS read data using a whole read overlap assembly that has been specially indexed for rapid search.

The architecture of the format is that all reads are put together in a read overlap graph, and all read information is retained, including reads that do not align exactly to the reference. This allows for very rapid query of all reads, not just those aligned to the reference. Additionally all the phasing information available in the data is also retained.

Graphs for multiple individuals can be combined together, so it is possible to directly compare between samples, even for areas that have not been mapped to the reference. Further, multiple references can be added to this combined graph, allowing for sequences to be compared to the most convenient reference.

The BioGraph Format supports a wide range of queries, including:

Searches within a Single Genome

  • At location X for a particular reference, what sequence is present? What are all the reads that map to that location?
  • Does Sequence Y appear anywhere in the sample?  In what reads and, if those reads aligned to any reference, where did they align?

Searches across a Set of Genomes

  • At location X for a particular reference, what are the frequencies of all the sequences that occur?
  • Does Sequence Y appear in the set of samples?  How many reads for each individual and, if these reads aligned to any reference, where did they align?

Comparisons between Multiple Sets of Genomes

  • What sequences are unique in one set of genomes? If reads with this sequence align to any reference, where do they align?
  • At what locations are there significant differences in the frequency of a particular allele? A specific query might be: at what locations are there over 90% of a particular allele in group A but less than 10% in group B.

Query Speed and Flexibility

Once the graph representation is in memory, searching by sequence or location can be done at the rate of over 200,000 queries a second on an 8-core machine, even across hundreds of genomes. Using the API, Python scripts can be created to answer specific scientific questions. Genomes and groups of genomes can be compared directly with each other in a matter of minutes and hours rather than days and weeks.