SpEC vs. CRAM

To compare methods, we compress a standard 50x NA12878 data set. The original data, in BAM format, is 114GB. CRAM was run in lossless mode. Both analyses were run on a machine with 32 GB RAM and 16 cores.

Compression Comparison

spec-cram-compression-comp

When decompressing the SpEC file, the original BAM file is recreated bit for bit. When the CRAM is decompressed, all the data are restored. However, the fields are in a different order compared to the original BAM, which could have an impact if scripts need to be rerun. Note that SpEC can compress any BAM file, regardless of the aligner used.

Run Time Comparison

spec-cram-runtime-comp

The wall clock time to compress the CRAM file on the machine was close to 7 hours. In comparison, the same file could be converted to SpEC format in around 2 hours on the same machine. This is a much smaller commitment to be able to compress the data, especially when generating very large number of sequences.

This analysis did not consider other benefits, such a still being able to access the data like a regular BAM file once compressed.

Download the Files

We took the platinum genomes NA12878 50x data set and then converted it using SpEC and CRAM lossless. SpEC compresses both the reads as well as other fields, optimizing the compression of the data that are stored. SpEC was created from the ground up to ensure that the original file can be recreated, byte for byte. Click the links below to download the original BAM file, as well as the compressed SpEC and CRAM files.



 

 

 

 

The decoder requires the human reference that was used to create the SpEC file. Use the reference found at the following link.