REFRESH Bioinformatics Group

DSRC

DSRC—What is it?

DNA Sequence Reads Compression is an application designed for compression of data files containing reads from DNA sequencing in FASTQ format. The amount of such files can be huge, e.g., a few (or tens) of gigabytes, so a need for a robust data compression tool is clear. Usually universal compression programs like gzip or bzip2 are used for this purpose, but it is obvious that a specialized tool can work better.

What DSRC can do?

The main features of DSRC are:

  • Effective multithreaded compression of DNA sequencing data stored in FASTQ format.
  • Full support for Illumina, ABI SOLiD, and 454/Ion Torrent dataset formats with non-standard (AGCTN) IUPAC base values.
  • Support for lossy quality values compression using Illumina binning scheme.
  • Support for lossy IDs compression keeping only key fields selected by user.
  • Pipes support for easy integration with current pipelines.
  • Python and C++ libraries allowing to integrate DSRC archives in own applications.
  • Availability for Linux and Windows 64-bit operating systems.
  • Open source C++ code under GNU GPL 2 license.

How good is DSRC?

In terms of compression factor (the ability to reduce the file size), DSRC is usually 35–60% better than gzip and 15–30% better than bzip.

In terms of compression and decompression speed, DSRC is much faster than any other FASTQ compressor (including gzip). When using 8 threads it can compress/decompress with a speed over 500MB/s.

Terms of use of DSRC

DSRC is in general a free compression program available in both binary and source code releases. More details can be found out on download page.

Publications

+ Roguski, L., Deorowicz, S., DSRC 2: Industry-oriented compression of FASTQ files, Bioinformatics, 2014; 30(15):2213–2215, Abstract.

Summary: Modern sequencing platforms produce huge amounts of data. Archiving them raises major problems but is crucial for reproducibility of results, one of the most fundamental principles of science. The widely used gzip compressor, used for reduction of storage and transfer costs, is not a perfect solution, so a few specialized FASTQ compressors were proposed recently. Unfortunately, they are often impractical due to slow processing, lack of support for some variants of FASTQ files, or instability. We propose DSRC 2 that offers compression ratios comparable to the best existing solutions, while being a few times faster and more flexible.
Availability and Implementation: DSRC 2 is freely available at http://sun.aei.polsl.pl/dsrc. The package contains: command-line compressor, C++ and Python libraries for easy integration with existing software, technical documentation with examples of usage.

+ Deorowicz, S., Grabowski, Sz., Compression of DNA sequences in FASTQ format, Bioinformatics, 2011; 27(6):860–862, Abstract.

Motivation: Modern sequencing instruments are able to generate at least hundreds of millions short reads of genomic data. Those huge volumes of data require effective means to store them, provide quick access to any record and enable fast decompression.
Results: We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org).
Availability: DSRC is freely available at http://sun.aei.polsl.pl/dsrc.