GDC—What is it?
Genome Differential Compressor is a utility designed for compression of genome collections from the same species. The amount of such collections can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression tool is clear. Universal compression programs like gzip or bzip2 might be used for this purpose, but it is obvious that a specialized tool can work much better, since a universal compressor does not use the properties of such data sets, e.g., long approximate repetitions at long distances.
The architecture of GDC
GDC is designed as a C++ library that can be used by various applications. The key features of the library are:
- compression of collections of genomes in FASTA format,
- decompression of the whole collection,
- decompression of only a single genome without decompressing the complete collection,
- decompression of any part of any chromosome from any genome of the collection without decompressing the complete collection and the complete genome (i.e., random access functionality).
The package contains also two sample applications.
GDC is able to:
- compress collections of genomes in FASTA format,
- decompress whole collection,
- decompress only a single genome without decompressing the complete collection.
TEST_RA is an application that performs tests of the random access queries to the compressed archive.
How good is GDC?
In terms of compression factor (the ability to reduce the file size), GDC is usually much better than universal compressors. Its compression factor for some test data sets are:
- 180–240 — on a collection of 70 genomes of H. sapiens from Complete Genomics Inc. (218,961.98 MB),
- 70–100 — on a collection of 39 genomes of S. cerevisiae (493.98 MB),
- 40–90 — on a collection of 36 genomes of S. paradoxus (436.43 MB),
- 12–17 — on a collaction of 4 genomes of H. sapiens (12,253.14 MB).
Compression and decompression speed
The compression speed of GDC varies depending on data, but for the mentioned data sets is from 17 to 40MB/s at a computer equipped with AMD Opteron 2.4GHz processor. The decompression speed is about 150MB/s which is at least on a par with the I/O system speed.
Random access time
The time necessary to decompress a snippet from the archive depends on the compression mode and the snippet size. However, for snippets of length 100 symbols the time is about 10us.
PublicationGDC algorithm was described in the paper:
S. Deorowicz, Sz. Grabowski, Robust relative compression of genomes with random access. Bioinformatics 2011, 27(21):2979–2986.
GDC is in general a free compression program available in both binary and source code releases. More details can be found on out download page.