REFRESH Bioinformatics Group

ORCOM

ORCOM—What is it?

Overlaping Reads COmpression with Minimizers is a compressor of sequencing reads. It takes as an input FASTQ files (possibly gzipped) and stores the DNA symbols of each read in a highly-compressed form. Id and quality fields are not stored. Thus, ORCOM cannot be treated as a full-fledged FASTQ compressor.

The architecture of ORCOM

ORCOM is implemented in C++. It can be downloaded as a source code or as a precompiled binaries for Linux operating system. A description of the tool can be found in our paper (see below).

How good is ORCOM?

The H. sapiens reads (ERA015743) composed of 1.34G reads of lengths 100–102 bp can be stored to 5.5GB. This means the compression ratio 0.327 bits per base. More results can be found in the paper.

Terms of use of ORCOM

ORCOM is in general a free compression program available in both binary and source code releases. More details can be found out on download page.

Publications

+ Grabowski, Sz., Deorowicz, S., Roguski, L., Disk-based compression of data from genome sequencing, Bioinformatics, 2014; 31(9):1389–1395, Abstract.

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk-based (Yanovsky, 2011; Cox et al., 2012), where the better of these two, from Cox et al. (2012), is based on the Burrows–Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gb human genome sequencing collection with almost 45-fold coverage.
Results: We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gb dataset into only 5.31GB of space.
Availability: http://sun.aei.polsl.pl/orcom under a free license.
Supplementary data: available at Bioinformatics online.