Introducing BioGraph™ Assembly
BioGraph Assembly is a structural variant caller built on top of Spiral's BioGraph Format.
One of the biggest challenges in large-scale sequencing is the ability to detect complex changes in the genome. These complex structural changes are important for understanding neurological disorders, cardiological conditions, rare childhood disorders, and investigating changes across individuals. Today's tools often miss many structural variations or are limited to one type, such as deletions. In addition, most tools have high false discovery rates, requiring considerable time sifting through structural variant calls to differentiate between real variants and false positives.
BioGraph Assembly offers a strong combination of high sensitivity and low false discovery rate with accurate breakpoints.
Comparison to the Genome in a Bottle Tier 1 SV Call SeT
BioGraph Assembly achieves 47% recall in comparison to the genome in a bottle Tier 1 SV call set (v0.6), using Truvari for comparison (https://github.com/spiralgenetics/truvari). Note that this call set includes calls using long read sequencing. As such, this recall rate is higher than many inference-based callers, as the diagram shows.
BioGraph Assembly leverages high throughput short-read NGS data to detect comprehensive variation including large indels (20 bp), larger structural variation events like segmental duplication (1kb - 20kb+), and full insertion sequence recovery (100 - 5kb+). Using a method that assembles reads first and maps back to the reference after, we dramatically reduce reference bias and can accurately detect changes often not detected by other methods. Specifically, the method searches for reads that have not aligned to the reference and then follows the read overlap assembly in both directions until over 70% of the reads match the reference. This is reported as a variant. Utilizing the BioGraph Format, this search for structural variants can be completed in approximately 4 hours.
NGS sequencing is moving toward comparing samples, whether a family trio or a large case-control study. Accuracy in calling breakpoints is important for comparing samples. A lead bioinformatics programmer at the Baylor College of Medicine, said that “BioGraph Assembly breakpoints have been shown to matched very closely if not exactly with PCR-validated breakpoints.”
Often, structural variant callers are good to detect only one particular type or size range of structural variant. The Baylor College of Medicine compared multiple structural variant callers. William Salerno, Bioinformatic Scientist at Baylor, says that “when looking at our high-confidence SV events with long-read PacBio support, we often found that Spiral was the only Illumina-based method contributing to that discovery. We even assemble long insertions with sufficient resolution to identify differences from parents.”
Low false discovery rate
Even if variant callers are able to detect variants, often there is a high rate of false discoveries. BioGraph Assembly has a false discovery rate of around 4%, up to around 10 times less than current standard structural variant callers.
Validation with Baylor HGSC - SV Caller Comparison
Percent Sensitivity Comparison of SV Tools
The Human Genome Sequencing Center at Baylor College of Medicine compared a number of short read SV tools and their ability to detect SVs larger than 100 bp using Illumina data. To compare these tools a set of “true” variants was first created. Any variant that was discovered by two or more variant callers was considered validated. After creating the true variant set, each of the bioinformatics tools was compared against this set.
We can see that with the BioGraph Assembly (formerly Anchored Assembly) method, there is high sensitivity with a low false discovery rate. The low false discovery rate may be attributed to the additional context of using the whole read to identify where in the reference or an assembly that read is actually placed.
Percent FDR Comparison for SV Tools
When comparing this data with the other callers, it is clear that a majority of calls that BioGraph Assembly made were validated by other methods. Further, all of other the major variant callers generate a tremendous number of erroneous variants. With Pindel, over 30% of all calls were false discoveries.
The researchers at Baylor noted that they “were pleased with its breakpoint resolution and accuracy. Spiral [BioGraph] breakpoints matched very closely if not exactly with PCR-validated breakpoints. Also, when looking at our high-confidence SV events (i.e., those with long-read PacBio support), we often found that Spiral [BioGraph] Assembly was the only Illumina-based method contributing to that discovery.”
Validation With Stanford Sidow Lab - Basepair Resolution Insertion
BioGraph Assembly was used to analyze the Ashkenazi Jewish trio data set sequenced by the Personal Genome Project (PGP) with Illumina 2x150bp libraries at 50x coverage per individual. Using bed tools on the output VCF files, we found 3993 structural variants present in the offspring, including 3,083 deletions, 562 de novo insertions and 348 other breakpoints. Of 10 SVs fin the offspring selected at random, all showed logical consistency of calls and segregation within the trio.
Example: 3.4 kb Insertion on Chr8
The insertion is heterozygous in the offspring (HG002) and heterozygous in both the father (HG003) and mother (HG004). Noah Spies of the Sidow lab at Stanford University produced the following figures using SVViz. Below are stylized repesentation of those results.
The inserted sequence was resolved down to the base pair. The difference in length of the insertion between family members is accounted for by a 1bp indel in the father passed down to the offspring. Further, there are 5 SNVs in the father and 1 SNV in the mother that have also been passed down.
The VCFs for this insertion are nearly identical for the parents and offspring:
HG003 chr8 129739066 sv_6699955 AATAAA 100 masking_present NS=1; DP=47; SVTYPE=INS; END=129739071; SVLEN=3405; DP:AD 47:32,15
HG0004 chr8 129739066 sv_8325876 AATAAA 100 masking_present NS=1; DP=41; SVTYPE=INS; END=129739071; SVLEN=3404; AID=8325876 DP:AD 41:28,13
HG002 chr8 129739066 sv_8095366 AATAAA 100 masking_present NS=1; DP=18; SVTYPE=INS; END=129739071; SVLEN=3405; DP:AD 18:0,18
Below is an overlap of the complete insertion output sequence with the variants within the variant colored. The SNPs and single base insertion from the father is indicated in blue , and the SNP from the mother is indicated in green.
Validation with University of Washington Eichler Lab - Fosmid/PacBio Comparison
In collaboration with the Eichler lab at the University of Washington, we validated a selection of large structural variations from NA12878 not found with other short read bioinformatics methods. We find variants in the new human reference, GRCh38.
Using a BioGraph Assembly, we detected 1889 structural variants using Illumina short read data on the individual NA12878 that were called exactly the same in two samples. We selected the longest variations from the set of previously unreported variants. We confirmed the existence of a subset using long-read PacBio sequencing of selected fosmids from an NA12878 fosmid library. These included nine deletions ranging from 1 kb to 26.8 kb in size, six de novo sequence insertions from 1.5 kb to 2.5 kb, and three inversion. Of the three inversions selected, one was confirmed as a 252 bp simple inversion. The second location was a sequence that had been copied and placed back in the genome which was consistent with the breakpoint reported, but was not a simple inversion. The third location was an event in a repetitive region that could not be validated with this methodology. These results confirm the ability of read overlap assembly to detect variants not otherwise detected by other bioinformatics methods using Illumina short read data.