REFRESH Bioinformatics Group

CoMSA

CoMSA—What is it?

Compressor of Multiple Sequence Alignments is a tool to represent a family of aligned protein sequences (or a collection of families) in a highly compressed form. As an input it takes the FASTA (single aligned family) of Stockholm (collection of families) files.

How good is CoMSA?

Pfam v. 31.0 Stockholm file of size 41.6 GB can be compressed to as 1.74 GB (compare to 5.6 GB for gzip). MSAC is also significantly faster in compression than gzip. More details can be found in our paper pointed below.

Terms of use of CoMSA

CoMSA is in general a free compression program available in source code release. More details can be found out on the download page.

Publications

+ Deorowicz, S., Walczyszyn, J., Debudaj-Grabysz, A., CoMSA: compression of protein multiple sequence alignment files, Bioinformatics, 2018; :Abstract.

Motivation: Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB ismultiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230GB, depending of the variant. Storage and transfer of such massive data has become a challenge.
Results: We propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.