One of the best methods for reconstructing a DNA sequence from short reads is de novo assembly. However, as the length of DNA sequenced increases, so too does the computational complexity of the problem. To accurately construct the variants for large whole genomes, we dramatically reduce the number of unique reads by first correcting reads for sequencing errors. Second, we remove reads that match the reference exactly. We then look for reads where a contiguous 80% of a read can be mapped uniquely to the reference (anchors). We then take a read and perform whole read overlap assembly until we reach an anchor on both ends. By using the entire read for whole read overlap assembly, there is more certainty over the variants generated. This leads to a greater sensitivity rate and a low false positive rate, even when running large whole genomes.
As a test of Anchored Assembly, we simulated SNPs, indels and structural variants on human chromosome 22. For insertions, deletions, tandem repeats, inversions and SNPs, Anchored Assembly identified over 95% of simulated variants for almost all sizes. For these indels and structural variants, we discovered one false positive for about every 900 variants simulated. In real data, we have observed 32kb tandem repeats, a 57.6kb insertion, a 47.6kb deletion and even transpositions across chromosomes, all with 100bp HiSeq® reads.