REFRESH Bioinformatics Group

GTShark

GTShark—What is it?

GTShark is a tool to compress large databases with genotype data. It also allows to use a compressed database of genotypes as a knowledgebase for compression of new samples. As an input it takes the VCF file.

How good is GTShark?

We were able to compress the genomes from the HRC (27,165 genotypes and about 40 million variants) from 4.3TB (uncompressed VCF file) to less than 1.7GB. More details can be found in our paper pointed below.

Terms of use of GTShark

GTShark is in general a free compression program available in source code release. More details can be found out on the download page.

Publications

+ Deorowicz, S., Danek, A., GTShark: Genotype compression in large project, bioRxiv.org, 2018; ():Abstract.

Motivation: Summary: Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes as well as single samples in such projects to sizes not achievable to date.
Availability and implementation: https://github.com/refresh-bio/GTShark.