Baylor HGSC Comparison of SV Callers

 

The Human Genome Sequencing Center at Baylor College of Medicine compared a number of short read SV tools and their ability to detect SVs larger than 100 bp using Illumina data. To compare these tools a set of “true” variants was first created. Any variant that was discovered by two or more variant callers was considered validated. After creating the true variant set, each of the bioinformatics tools was compared against this set.

 

 

Program

Sensitivity        

False Discovery Rate

Spiral Anchored Assembly

49%

6%

BreakDancer

42%

59%

Delly

31%

55%

Pindel

57%

32%

English et al (2015), updated

We can see that with the Anchored Assembly method, there is high sensitivity with a low false discovery rate. The low false discovery rate may be attributed to the additional context of using the whole read to identify where in the reference or an assembly that read is actually placed.

When comparing this data with the other callers, it is clear that a majority of calls Anchored Assembly made were validated by other methods. Further, all of other the major variant callers generate a tremendous number of erroneous variants. With Pindel, over 30% of all calls were false discoveries.

The researchers at Baylor noted that they “were pleased with its breakpoint resolution and accuracy. Spiral Anchored breakpoints matched very closely if not exactly with PCR-validated breakpoints. Also, when looking at our high-confidence SV events (i.e., those with long-read PacBio support), we often found that Spiral Anchored Assembly was the only Illumina-based method contributing to that discovery.

Ashkenazi Jewish Trio

Anchored Assembly was used to analyze the Ashkenazi Jewish trio data set sequenced by the Personal Genome Project (PGP) with Illumina 2x150bp libraries at 50x coverage per individual. Using bed tools on the output VCF files, we found 3993 structural variants present in the offspring, including 3,083 deletions, 562 de novo insertions and 348 other breakpoints. Of 10 SVs fin the offspring selected at random, all showed logical consistency of calls and segregation within the trio.

Example: 3.4 kb Insertion on Chr8

The insertion is heterozygous in the offspring (HG002) and heterozygous in both the father (HG003) and mother (HG004). Noah Spies of the Sidow lab at Stanford University produced the following figures using SVViz.

Father (HG003)

chr8 ins father alt

chr8 ins father ref

Mother (HG004)

chr8 ins mother alt

chr8 ins mother ref

Offspring (HG002)

chr8 ins offspring alt

chr8 ins offspring ref

The inserted sequence was resolved down to the base pair. The difference in length of the insertion between family members is accounted for by a 1bp indel in the father passed down to the offspring. Further, there are 5 SNVs in the father and 1 SNV in the mother that have also been passed down.

The VCFs for this insertion are identical for the parents and offspring:

HG003 chr8 129739066 sv_6699955 AATAAA 100 masking_present NS=1; DP=47; SVTYPE=INS; END=129739071; SVLEN=3405; DP:AD 47:32,15

HG0004 chr8 129739066 sv_8325876 AATAAA 100 masking_present NS=1; DP=41; SVTYPE=INS; END=129739071; SVLEN=3404; AID=8325876 DP:AD 41:28,13

HG002 chr8 129739066 sv_8095366 AATAAA 100 masking_present NS=1; DP=18; SVTYPE=INS; END=129739071; SVLEN=3405; DP:AD 18:0,18

You can find the complete sequence in the originals VCF files, found on the NCBI website.

You can also download the gzipped VCFs directly: offspring (HG002), father (HG003), mother (HG004).

Validation with Fosmid/PacBio Sequencing

In collaboration with the Eichler lab at the University of Washington, we validated a selection of large structural variations from NA12878 not found with other short read bioinformatics methods.  We find variants in the new human reference, GRCh38.  Contact us for more information.