GDC—What is it?

Genome Differential Compressor is a utility designed for compression of genome collections from the same species. The amount of such collections can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression tool is clear. Universal compression programs like gzip or bzip2 might be used for this purpose, but it is obvious that a specialized tool can work much better, since a universal compressor does not use the properties of such data sets, e.g., long approximate repetitions at long distances.

The architecture of GDC

GDC is designed as a C++ library that can be used by various applications. The key features of the library are:

  • compression of collections of genomes in FASTA format,
  • decompression of the whole collection,
  • decompression of only a single genome without decompressing the complete collection,
  • decompression of any part of any chromosome from any genome of the collection without decompressing the complete collection and the complete genome (i.e., random access functionality).

The package contains also two sample applications.

GDC compressor

GDC is able to:

  • compress collections of genomes in FASTA format,
  • decompress whole collection,
  • decompress only a single genome without decompressing the complete collection.

TEST_RA program

TEST_RA is an application that performs tests of the random access queries to the compressed archive.

How good is GDC?

Compression factor

In terms of compression factor (the ability to reduce the file size), GDC is usually much better than universal compressors. Its compression factor for some test data sets are:

  • 180–240 — on a collection of 70 genomes of H. sapiens from Complete Genomics Inc. (218,961.98 MB),
  • 70–100 — on a collection of 39 genomes of S. cerevisiae (493.98 MB),
  • 40–90 — on a collection of 36 genomes of S. paradoxus (436.43 MB),
  • 12–17 — on a collaction of 4 genomes of H. sapiens (12,253.14 MB).
Since the idea of the GDC is to compress the genomes relatively to some reference genome, the more important compression factors could be defined as the ability of compact representation of the next genomes (the ones that follow the reference sequence). In such a case the compression factors for H. sapiens achieve from 420 to 1000. This means that a whole human genome can be stored in less than 3.12 MB.

Compression and decompression speed

The compression speed of GDC varies depending on data, but for the mentioned data sets is from 17 to 40MB/s at a computer equipped with AMD Opteron 2.4GHz processor. The decompression speed is about 150MB/s which is at least on a par with the I/O system speed.

Random access time

The time necessary to decompress a snippet from the archive depends on the compression mode and the snippet size. However, for snippets of length 100 symbols the time is about 10us.

Publication

GDC algorithm was described in the paper:
S. Deorowicz, Sz. Grabowski, Robust relative compression of genomes with random access. Bioinformatics 2011, 27(21):2979–2986.

Terms of use of GDC

GDC is in general a free compression program available in both binary and source code releases. More details can be found on out download page.