REFRESH Bioinformatics Group

GDC

GDC—What is it?

Genome Differential Compressor is a utility designed for compression of genome collections from the same species. The amount of such collections can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression tool is clear. Universal compression programs like gzip or bzip2 might be used for this purpose, but it is obvious that a specialized tool can work much better, since a universal compressor does not use the properties of such data sets, e.g., long approximate repetitions at long distances.

GDC 2

The architecture of GDC 2

GDC 2 is designed as a C++ application. The key features of the software are:

  • compression of collections of genomes in FASTA format,
  • decompression of the whole collection,
  • decompression of only a single genome without decompressing the complete collection,

How good is GDC?

Compression factor

In terms of compression factor (the ability to reduce the file size), GDC is usually much better than universal compressors and other specialized genome compressors. Its compression factor for some test data sets are:

Compression and decompression speed

The compression speed of GDC varies depending on data, but for the mentioned data sets is from 95 to 200MB/s at a computer equipped with 6-core Intel i7 4930K 3.4GHz processor. The decompression speed is dependent on the disk speed and is up to 1000 MB/s on the mentioned system.

GDC 0.x and 1.x

The architecture of GDC 0.x and 1.x

GDC is designed as a C++ library that can be used by various applications. The key features of the library are:

  • compression of collections of genomes in FASTA format,
  • decompression of the whole collection,
  • decompression of only a single genome without decompressing the complete collection,
  • decompression of any part of any chromosome from any genome of the collection without decompressing the complete collection and the complete genome (i.e., random access functionality).

The package contains also two sample applications.

GDC compressor

GDC is able to:

  • compress collections of genomes in FASTA format,
  • decompress whole collection,
  • decompress only a single genome without decompressing the complete collection.

TEST_RA program

TEST_RA is an application that performs tests of the random access queries to the compressed archive.

How good is GDC?

Compression factor

In terms of compression factor (the ability to reduce the file size), GDC is usually much better than universal compressors. Its compression factor for some test data sets are:

  • 180–240 — on a collection of 70 genomes of H. sapiens from Complete Genomics Inc. (218,961.98 MB),
  • 70–100 — on a collection of 39 genomes of S. cerevisiae (493.98 MB),
  • 40–90 — on a collection of 36 genomes of S. paradoxus (436.43 MB),
  • 12–17 — on a collaction of 4 genomes of H. sapiens (12,253.14 MB).
Since the idea of the GDC is to compress the genomes relatively to some reference genome, the more important compression factors could be defined as the ability of compact representation of the next genomes (the ones that follow the reference sequence). In such a case the compression factors for H. sapiens achieve from 420 to 1000. This means that a whole human genome can be stored in less than 3.12 MB.

Compression and decompression speed

The compression speed of GDC varies depending on data, but for the mentioned data sets is from 17 to 40MB/s at a computer equipped with AMD Opteron 2.4GHz processor. The decompression speed is about 150MB/s which is at least on a par with the I/O system speed.

Random access time

The time necessary to decompress a snippet from the archive depends on the compression mode and the snippet size. However, for snippets of length 100 symbols the time is about 10us.

Publications

+ Deorowicz, S., Danek, A., Niemiec, M., GDC 2: Compression of large collections of genomes, arxiv.org draft, 2015; ():Abstract.

The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about.

+ Deorowicz, S., Grabowski, Sz., Robust relative compression of genomes with random access, Bioinformatics, 2011; 27(21):2979–2986, Abstract.

Motivation: Storing, transferring and maintaining genomic databases becomes a major challenge because of the rapid technology progress in DNA sequencing and correspondingly growing pace at which the sequencing data are being produced. Efficient compression, with support for extraction of arbitrary snippets of any sequence, is the key to maintaining those huge amounts of data.
Results: We present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over an order of magnitude greater. In particular, 69 differentially encoded human genomes are compressed over 400 times at fast compression, or even 1000 times at slower compression (the reference genome itself needs much more space). Adding fast random access to text snippets decreases the ratio to ~300.
Availability: GDC is available at http://sun.aei.polsl.pl/gdc.