REFRESH Bioinformatics Group

TGC

TGC—What is it?

Thousands Genome Compressor is a tool to estimate the boundaries of compression ratio for human genome compression. It can be also used as a very effective tool for compression Variant Call Format (VCF) files.

The architecture of TGC

TGC is composed of several programs that were used in our experiments on the genomes from the 1000 Genomes project. A description of the tool can be found in TGC description.

How good is TGC?

We were able to compress the genomes from the 1000GP about 15,500 times. More details can be found in our paper pointed below.

Terms of use of TGC

TGC is in general a free compression program available in source code releas. More details can be found out on download page.

Publications

+ Deorowicz, S., Danek, A., Grabowski, Sz., Genome compression: a novel approach for large collections, Bioinformatics, 2013; 29(20):2572–2578, Abstract.

Motivation: Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms.
Results: We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ~400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions.
Availability: http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license.