REFRESH Bioinformatics Group

GTC

GTC—What is it?

GenoType Compressor is a tool to represent a collection of genotypes in a highly compact form. As an input it takes the VCF file. The compressed structure supports fast queries of various types.

How good is GTC?

We were able to compress the genomes from the HRC (27,165 genotypes and about 40 million variants) from 4.3TB (uncompressed VCF file) to less than 4GB. More details can be found in our paper pointed below.

Terms of use of GTC

GTC is in general a free compression program available in source code release. More details can be found out on the download page.

Publications

+ Danek, A., Deorowicz, S., GTC: how to maintain huge genotype collections in a compressed form, Bioinformatics, 2018; 34(11):1834–1840, Abstract.

Motivation: Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial.
Results: We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second.
Availability and implementation: GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc.