Publications

Published papers

Preprints

Deorowicz, S., FQSqueezer: k-mer-based compression of sequencing data, bioRxiv.org, 2019; ():Abstract.

Motivation: The amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.
Results: We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.

Deorowicz, S., Gudys, A., Whisper 2: indel-sensitive short read mapping, bioRxiv.org, 2019; :Abstract.

Summary: Whisper 2 is a short-read-mapping software providing superior quality of indel variant calling. Its running times place it among the fastest existing tools.

2020

2019

Deorowicz, S., Walczyszyn, J., Debudaj-Grabysz, A., CoMSA: compression of protein multiple sequence alignment files, Bioinformatics, 2019; 35(2):22–234, Abstract.

Motivation: Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB ismultiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230GB, depending of the variant. Storage and transfer of such massive data has become a challenge.
Results: We propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.

Kawulok, J., Kawulok, M., Deorowicz, S., Environmental metagenome classification for constructing a microbiome fingerprint, Biology Direct, 2019; 14(20):Abstract.

Background: Nowadays, not only are single genomes commonly analyzed, but also metagenomes, which are sets of, DNA fragments (reads) derived from microbes living in a given environment. Metagenome analysis is aimed at extracting crucial information on the organisms that have left their traces in an investigated environmental sample.In this study we focus on the MetaSUB Forensics Challenge (organized within the CAMDA 2018 conference) which consists in predicting the geographical origin of metagenomic samples. Contrary to the existing methods for environmental classification that are based on taxonomic or functional classification, we rely on the similarity between a sample and the reference database computed at a reads level.
Results: We report the results of our extensive experimental study to investigate the behavior of our method and its sensitivity to different parameters. In our tests, we have followed the protocol of the MetaSUB Challenge, which allowed us to compare the obtained results with the solutions based on taxonomic and functional classification.
Conclusions: The results reported in the paper indicate that our method is competitive with those based on taxonomic classification. Importantly, by measuring the similarity at the reads level, we avoid the necessity of using large databases with annotated gene sequences. Hence our main finding is that environmental classification of metagenomic data can be proceeded without using large databases required for taxonomic or functional classification.

Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, Bioinformatics, 2019; 35(22):4791–4793, Abstract.

Summary: Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e. squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results.
Availability and implementation: https://github.com/refresh-bio/GTShark.

Deorowicz, S., Gudys, A., Dlugosz, M., Kokot, M., Danek, A., Kmer-db: instant evolutionary distance estimation, Bioinformatics, 2019; 35(1):133–136, Abstract.

Summary: Kmer-db is a new tool for estimating evolutionary relationship on the basis of k-mers extracted from genomes or sequencing reads. Thanks to an efficient data structure and parallel implementation, our software estimates distances between 40,715 pathogens in <7 min (on a modern workstation), 26 times faster than Mash, its main competitor.
Availability and implementation: https://github.com/refresh-bio/kmer-db.

Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Whisper: read sorting allows robust robust mapping of DNA sequencing data, Bioinformatics, 2019; 35(12):2043–2050, Abstract.

Motivation: Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time.
Results: We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline.

2018

Deorowicz, S., Grabowski, Sz., DeltaComp: Fast and efficient compression of astronomical timelines, New Astronomy, 2018; 65():59–66, Abstract.

DeltaComp compresses Planck spacecraft data tens of times at low quantization error. Abstract Astronomical instruments commonly generate large data series in tabular format. To efficiently store and transmit these data series, carefully designed compression formats are welcome. We propose DeltaComp, a free open-source program suited for (relatively) smooth streams of data. DeltaComp is based on rather simple mechanisms; essentially it is a tailored combination of delta coding and context-based modeling. Following the methodology of the preceding work presenting Polycomp (Tomasi, 2016), we compress (i) the ephemeris table of Ganymede, and (ii) the publicly available timelines recorded by LFI, an array of microwave radiometers aboard the ESA Planck spacecraft. In the former case (with small data) the compression ratio advantage of DeltaComp over Polycomp is by a factor of about 1.4. For the Planck data, of size 4.24TB, the archives produced by DeltaComp are almost six times smaller than those from Polycomp, at somewhat smaller median quantization error (and much smaller maximum error), which translates to only 66GB of required storage. Importantly, DeltaComp is over three orders of magnitude faster than Polycomp.

Roguski, L., Ochoa, I., Hernaez, M., Deorowicz, S., FaStore: a space-saving solution for raw sequencing data, Bioinformatics, 2018; 34(16):2748–2756, Abstract.

Motivation: The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs.
Results: FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance.
Availability and implementation: FaStore can be downloaded from https://github.com/refresh-bio/FaStore.

Danek, A., Deorowicz, S., GTC: how to maintain huge genotype collections in a compressed form, Bioinformatics, 2018; 34(11):1834–1840, Abstract.

Motivation: Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial.
Results: We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second.
Availability and implementation: GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc.

Kokot, M., Deorowicz, S., Długosz, M., Even Faster Sorting of (Not Only) Integers, Chapter in book Man-Machine Interactions 5, Series Advances in Intelligent Systems and Computing, Springer International Publishing Switzerland, Gruca, A., Czachorski, T., Harezlak, K., Kozielski, S., Piotrowska, A., Editors; 2018; 659:481–491, Abstract.

In this paper we introduce RADULS2, the fastest parallel sorter based on radix algorithm. It is optimized to process huge amounts of data making use of modern multicore CPUs. The main novelties include: high performance algorithm for handling tiny arrays (up to about a hundred of records) that could appear even billions times as subproblems to handle and improved processing of larger subarrays with better use of non-temporal memory stores.

Długosz, M., Deorowicz, S., Kokot, M., Improvements in DNA Reads Correction, Chapter in book Man-Machine Interactions 5, Series Advances in Intelligent Systems and Computing, Springer International Publishing Switzerland, Gruca, A., Czachorski, T., Harezlak, K., Kozielski, S., Piotrowska, A., Editors; 2018; 659:115–124, Abstract.

We introduce an improved version of RECKONER, an error corrector for Illumina whole genome sequencing data. By modifying its workflow we reduce the computation time even 10 times. We also propose a new method of determination of k-mer length, the key parameter of k-spectrum-based family of correctors. The correction algorithms are examined on huge data sets, i.e., human and maize genomes for both Illumina HiSeq and MiSeq instruments.

2017

Kokot, M., Długosz, M., Deorowicz, S., KMC 3: counting and manipulating k -mer statistics, Bioinformatics, 2017; 33(17):2759–2761, Abstract.

Summary: Counting all k-mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases. Usefulness of the tools is shown on a few real problems.
Availability: Program is freely available at http://sun.aei.polsl.pl/REFRESH/kmc.

Gudys, A., Deorowicz, S., QuickProbs 2: towards rapid construction of high-quality alignments of large protein families, Scientific Reports, 2017; 7(41553):Abstract.

The ever-increasing size of sequence databases caused by the development of high throughput sequencing, poses to multiple alignment algorithms one of the greatest challenges yet. As we show, well-established techniques employed for increasing alignment quality, i.e., refinement and consistency, are ineffective when large protein families are investigated. We present QuickProbs 2, an algorithm for multiple sequence alignment. Based on probabilistic models, equipped with novel column-oriented refinement and selective consistency, it offers outstanding accuracy. When analysing hundreds of sequences, Quick-Probs 2 is noticeably better than ClustalOmega and MAFFT, the previous leaders for processing numerous protein families. In the case of smaller sets, for which consistency-based methods are the best performing, QuickProbs 2 is also superior to the competitors. Due to low computational requirements of selective consistency and utilization of massively parallel architectures, presented algorithm has similar execution times to ClustalOmega, and is orders of magnitude faster than full consistency approaches, like MSAProbs or PicXAA. All these make QuickProbs 2 an excellent tool for aligning families ranging from few, to hundreds of proteins.

Maciej Długosz, M., Deorowicz, S., RECKONER: read error corrector based on KMC, Bioinformatics, 2017; 33(7):1086–1089, Abstract.

Summary: Presence of sequencing errors in data produced by next-generation sequencers affects quality of downstream analyzes. Accuracy of them can be improved by performing error correction of sequencing reads. We introduce a new correction algorithm capable of processing eukaryotic close to 500 Mbp-genome-size, high error-rated data using less than 4GB of RAM in about 35min on 16-core computer.
Availability and Implementation: Program is freely available at http://sun.aei.polsl.pl/REFRESH/reckoner.
Contact: sebastian.deorowicz@polsl.pl
Supplementary information: Supplementary data are available at Bioinformatics online.

Grabowski, Sz., Raniszewski, M., Deorowicz, S., FM-index for Dummies, Chapter in book Beyond Databases, Architectures, and Structures, Series Communications in Computer and Information Science, Springer International Publishing Switzerland, Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2017; 718:189–201, Abstract.

Full-text search refers to techniques for searching a document, or a document collection, in a full-text database. To speed up such searches, the given text should be indexed. The FM-index is a celebrated compressed data structure for full-text pattern searching. After the first wave of interest in its theoretical developments, we can observe a surge of interest in practical FM-index variants in the last few years. These enhancements are often related to a bit-vector representation, augmented with an efficient rank-handling data structure. In this work, we propose a new, cache-friendly, implementation of the rank primitive and advocate for a very simple architecture of the FM-index, which trades compression ratio for speed. Experimental results show that our variants are 2–3 times faster than the fastest known ones, for the price of using typically 1.5–5 times more space.

Kokot, M., Deorowicz, S., Debudaj-Grabysz, A., Sorting Data on Ultra-Large Scale with RADULS: New Incarnation of Radix Sort, Chapter in book Beyond Databases, Architectures, and Structures, Series Communications in Computer and Information Science, Springer International Publishing Switzerland, Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2017; 718:235–245, Abstract.

The paper introduces RADULS, a new parallel sorter based on radix sort algorithm, intended to organize ultra-large data sets efficiently. For example 4G 16-byte records can be sorted with 16 threads in less than 15 seconds on Intel Xeon-based workstation. The implementation of RADULS is not only highly optimized to gain such an excellent performance, but also parallelized in a cache friendly manner to make the most of modern multicore architectures. Besides, our parallel scheduler launches a few different procedures at runtime, according to the current parameters of the execution, for proper workload management. All experiments show RADULS to be superior to competing algorithms.

2016

Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., FAMSA: Fast and accurate multiple sequence alignment of huge protein families, Scientific Reports, 2016; 6(33964):Abstract.

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8?GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

Deorowicz, S., Grabowski, Sz., Ochoa, I., Hernaez, M., Weissman, T., Comment on: "ERGC: An efficient referential genome compression algorithm", Bioinformatics, 2016; 32(7):1115–1117, Abstract.

Motivation: Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors.
Results: We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp.

2015

Kabza, M., Kubiak, M.R., Danek, A., Rosikiewicz, W., Deorowicz, S., Polański, A., Makałowska, I., Inter-population Differences in Retrogene Loss and Expression in Humans, PLOS Genetics, 2015; 11(10):e1005579, Abstract.

Gene retroposition leads to considerable genetic variation between individuals. Recent studies revealed the presence of at least 208 retroduplication variations (RDVs), a class of polymorphisms, in which a retrocopy is present or absent from individual genomes. Most of these RDVs resulted from recent retroduplications. In this study, we used the results of Phase 1 from the 1000 Genomes Project to investigate the variation in loss of ancestral (i.e. shared with other primates) retrocopies among different human populations. In addition, we examined retrocopy expression levels using RNA-Seq data derived from the Ilumina BodyMap project, as well as data from lymphoblastoid cell lines provided by the Geuvadis Consortium. We also developed a new approach to detect novel retrocopies absent from the reference human genome. We experimentally confirmed the existence of the detected retrocopies and determined their presence or absence in the human genomes of 17 different populations. Altogether, we were able to detect 193 RDVs; the majority resulted from retrocopy deletion. Most of these RDVs had not been previously reported. We experimentally confirmed the expression of 11 ancestral retrogenes that underwent deletion in certain individuals. The frequency of their deletion, with the exception of one retrogene, is very low. The expression, conservation and low rate of deletion of the remaining 10 retrocopies may suggest some functionality. Aside from the presence or absence of expressed retrocopies, we also searched for differences in retrocopy expression levels between populations, finding 9 retrogenes that undergo statistically significant differential expression.

Kowalski, T., Grabowski, Sz., Deorowicz, S., Indexing arbitrary-length k-mers in sequencing reads, PLOS ONE, 2015; 10(7):1–16, Abstract.

We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.

Deorowicz, S., Danek, A., Niemiec, M., GDC 2: Compression of large collections of genomes, Scientific Reports, 2015; 5(11565):1–12, Abstract.

The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files.

Kawulok, J., Deorowicz, S., CoMeta: Classification of Metagenomes Using k-mers, PLOS ONE, 2015; 10(4):1–23, Abstract.

Nowadays, the study of environmental samples has been developing rapidly. Characterization of the environment composition broadens the knowledge about the relationship between species composition and environmental conditions. An important element of extracting the knowledge of the sample composition is to compare the extracted fragments of DNA with sequences derived from known organisms. In the presented paper, we introduce an algorithm called CoMeta (Classification of metagenomes), which assigns a query read (a DNA fragment) into one of the groups previously prepared by the user. Typically, this is one of the taxonomic rank (e.g., phylum, genus), however prepared groups may contain sequences having various functions. In CoMeta, we used the exact method for read classification using short subsequences (k-mers) and fast program for indexing large set of k-mers. In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons. The presented experimental study confirms that CoMeta outperforms other programs used in this context. CoMeta is available at https://github.com/jkawulok/cometa under a free GNU GPL 2 license.

Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; 31:1569–1576, Abstract.

Motivation: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.
Results: We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k; x)-mers allows to significantly reduce the I/O, and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106GB of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with an SSD.
Availability: KMC 2 is freely available at http://sun.aei.polsl.pl/kmc.

Grabowski, Sz., Deorowicz, S., Roguski, L., Disk-based compression of data from genome sequencing, Bioinformatics, 2015; 31:1389–1395, Abstract.

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk-based (Yanovsky, 2011; Cox et al., 2012), where the better of these two, from Cox et al. (2012), is based on the BurrowsWheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gb human genome sequencing collection with almost 45-fold coverage.
Results: We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gb dataset into only 5.31GB of space.
Availability: http://sun.aei.polsl.pl/orcom under a free license.
Supplementary data: available at Bioinformatics online.

2014

Danek, A., Deorowicz, S., Grabowski, Sz., Indexes of Large Genome Collections on a PC, PLOS ONE, 2014; 9(10):e109384, Abstract.

The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16–32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 μs and with up to 3 mismatches in 373 μs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 μs and 917 μs. Software is available at http://sun.aei.polsl.pl/mugi under a free license. Data S1 is available at PLOS One online.

Deorowicz, S., Grabowski, Sz., Efficient algorithms for the longest common subsequence in k-length substrings, Information Processing Letters, 2014; 114(11):634–638, Abstract.

Finding the longest common subsequence in k-length substrings (LCSk) is a recently proposed problem motivated by computational biology. This is a generalization of the well-known LCS problem in which matching symbols from two sequences A and B are replaced with matching non-overlapping substrings of length k from A and B. We propose several algorithms for LCSk, being non-trivial incarnations of the major concepts known from LCS research (dynamic programming, sparse dynamic programming, tabulation). Our algorithms make use of a linear-time and linear-space preprocessing finding the occurrences of all the substrings of length k from one sequence in the other sequence.

Roguski, L., Deorowicz, S., DSRC 2: Industry-oriented compression of FASTQ files, Bioinformatics, 2014; 30(15):2213–2215, Abstract.

Summary: Modern sequencing platforms produce huge amounts of data. Archiving them raises major problems but is crucial for reproducibility of results, one of the most fundamental principles of science. The widely used gzip compressor, used for reduction of storage and transfer costs, is not a perfect solution, so a few specialized FASTQ compressors were proposed recently. Unfortunately, they are often impractical due to slow processing, lack of support for some variants of FASTQ files, or instability. We propose DSRC 2 that offers compression ratios comparable to the best existing solutions, while being a few times faster and more flexible.
Availability and Implementation: DSRC 2 is freely available at http://sun.aei.polsl.pl/dsrc. The package contains: command-line compressor, C++ and Python libraries for easy integration with existing software, technical documentation with examples of usage.

Gudys, A., Deorowicz, S., QuickProbs – A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors, PLOS ONE, 2014; 9(2):e88901, Abstract.

Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors.

Wieczorek, B., Polomski, M., Pecka, P., Deorowicz, S., An Effective Way of Storing and Accessing Very Large Transition Matrices Using Multi-core CPU and GPU Architectures, Chapter in book Beyond Databases, Architectures, and Structures, Series Communications in Computer and Information Science, Springer International Publishing Switzerland, Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2014; 424:323–334, Abstract.

The topic of probabilistic model checking algorithms of Markov processes makes the problem of storing and processing of very large transition matrices crucial. While the symbolic approach utilises very effective storage and access methods, like Multi-terminal Binary Decision Diagrams, the traditional explicit method of representing and solving the model using sparse matrices has many advantages, especially when we concired the flexibility of computing of different properties. We show and examine a compact representation of a very large transition matrix in the sparse form. The technique employs an effective method of compression, storage, and accessing of the matrix and is suitable for use in mutli-core CPU and GPU environments. Such large transition matrices consume a lot of space and cannot be considered as typical types of data for storage in database systems. Nevertheless, their storage and efficient access to them could be beneficial for solving of Markov models, so constructing specialised compression methods is obviously the way to go.

Kawulok, J., Deorowicz, S., An Improved Algorithm for Fast and Accurate Classification of Sequences, Chapter in book Beyond Databases, Architectures, and Structures, Series Communications in Computer and Information Science, Springer International Publishing Switzerland, Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2014; 424:335–344, Abstract.

Understanding of biocenosis derived from environmental samples can help understanding the relationships between organisms and the environmental conditions of their occurrence. Therefore, the classification of DNA fragments that are selected from different places is an important issue in many studies. In this paper we report how to improve (in terms of speed and qualification accuracy) the algorithm of fast and accurate classification of sequences (FACS).

Danek, A., Deorowicz, S., Bit-Parallel Algorithm for the Block Variant of the Merged Longest Common Subsequence Problem, Chapter in book Man-Machine Interactions 3, Series Advances in Intelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:173–181, Abstract.

The problem of comparison of genomic sequences is of great importance. There are various measures of similarity of sequences. One of the most popular is the length of the longest common subsequence (LCS). We propose the first bit-parallel algorithm for the variant of the LCS problem, block merged LCS, which was recently formulated in the studies on the whole genome duplication hypothesis. Practical experiments show that our proposal is from 10 to over 100 times faster than existing algorithms.

Deorowicz, S., Debudaj-Grabysz, A., Gudyś, A., Kalign-LCS – A More Accurate and Faster Variant of Kalign2 Algorithm for the Multiple Sequence Alignment Problem, Chapter in book Man-Machine Interactions 3, Series Advances in Intelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:495–502, Abstract.

Determination of similarities between species is a crucial issue in life sciences. This task is usually done by comparing fragments of genomic or proteomic sequences of organisms subjected to analysis. The basic procedure which facilitates these comparisons is called multiple sequence alignment. There are a lot of algorithms aiming at this problem, which are either accurate or fast. We present Kalign-LCS, a variant of fast Kalign2 algorithm, that addresses the accuracy vs. speed trade-off. It employs the longest common subsequence measure and was thoroughly optimized. Experiments show that it is faster than Kalign2 and produces noticeably more accurate alignments.

Deorowicz, S., Grabowski, Sz., Subcubic Algorithms for the Sequence Excluded LCS Problem, Chapter in book Man-Machine Interactions 3, Series Advances in Intelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:503–510, Abstract.

There are a number of generalizations of the longest common subsequence problem in which some additional constraining sequence enforces some properties of the result. We investigate one of such problems, defined recently, i.e., the problem of the longest common subsequence of two sequences in which a constraint is forbidden to be found as a subsequence of the result. The known algorithms solve it in O(nmr) time, where n and m are the two sequence lengths, and r is the length of the constraining sequence. We present two algorithms, with time complexities of O(nmr/log n) and O(nmr/log^3/2 n), respectively, where the better result is possible if m/r = log^O(1) n.

Susik, R., Grabowski, Sz., Deorowicz, S., Fast and Simple Circular Pattern Matching, Chapter in book Man-Machine Interactions 3, Series Advances in Intelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:537–544, Abstract.

The problem of circular pattern matching is to find all rotations of a given pattern P in text T, both over a common alphabet. The pattern and any of its rotations are also called conjugates in the literature. For the online version of this problem we present a new general approach and use several matching techniques as components, based on bit-parallelism and filtering. The experimental results show the effectiveness of the method, with matching speeds reaching 7-8 GB/s for long patterns and natural language or protein data.

2013

Deorowicz, S., Danek, A., Bit-Parallel Algorithms for the Merged Longest Common Subsequence Problem, International Journal of Foundations of Computer Science, 2013; 24(8):1281–1298, Abstract.

It is often a necessity to compare some sequences to find out how similar they are. There are many similarity measures that can be used, e.g., longest common subse- quence, edit distance, sequence alignment. Recently a merged longest common subse- quence (MergedLCS) problem was formulated with applications in bioinformatics. We propose the bit-parallel algorithms for the MergedLCS problem and evaluate them in practice showing that they are usually tens times faster than the already published methods.

Deorowicz, S., Grabowski, Sz., Data compression for sequencing data, Algorithms for Molecular Biologgy, 2013; 8():Article no. 25, Abstract.

Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question `why compression' in a quantitative manner. Then we also answer the questions `what' and `how', by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question `why compression' and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology.

Deorowicz, S., Debudaj-Grabysz, A., Grabowski, Sz., Disk-based k-mer counting on a PC, BMC Bioinformatics, 2013; 14():Article no. 160, Abstract.

Background The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.
Results We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.
Conclusions By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.

Deorowicz, S., Danek, A., Grabowski, Sz., Genome compression: a novel approach for large collections, Bioinformatics, 2013; 29(20):2572–2578, Abstract.

Motivation: Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms.
Results: We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ?400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions.
Availability: http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license.

Roguski, L., Deorowicz, S., Fluid motion modelling using vortex particle method on GPU, Studia Informatica, 2013; 34(1 (110)):31–48, Abstract.

In this paper we present the vortex-in-cell method aimed at graphic processor units. Inviscid fluid model is examined in domain with periodic boundary conditions. The leap-frogging vortex rings simulation results are presented with sample vortex rings collision voisualization. At the end the GPU solver performance advantage over CPU solver is presented.

2012

Szczesniak, M.W., Deorowicz, S., Gapski, J., Kaczynski, L., Makalowska, I., miRNEST database: an integrative approach in microRNA search and annotation, Nucleic Acids Research, 2012; 40(D1):D198–D204, Abstract.

Despite accumulating data on animal and plant microRNAs and their functions, existing public miRNA resources usually collect miRNAs from a very limited number of species. A lot of microRNAs, including those from model organisms, remain undiscovered. As a result there is a continuous need to search for new microRNAs. We present miRNEST (http://mirnest.amu.edu.pl), a comprehensive database of animal, plant and virus microRNAs. The core part of the database is built from our miRNA predictions conducted on Expressed Sequence Tags of 225 animal and 202 plant species. The miRNA search was performed based on sequence similarity and as many as 10 004 miRNA candidates in 221 animal and 199 plant species were discovered. Out of them only 299 have already been deposited in miRBase. Additionally, miRNEST has been integrated with external miRNA data from literature and 13 databases, which includes miRNA sequences, small RNA sequencing data, expression, polymorphisms and targets data as well as links to external miRNA resources, whenever applicable. All this makes miRNEST a considerable miRNA resource in a sense of number of species (544) that integrates a scattered miRNA data into a uniform format with a user-friendly web interface.

Deorowicz, S., Quadratic-time algorithm for a string constrained LCS problem, Information Processing Letters, 2012; 112(11):423–426, Abstract.

The problem of nding a longest common subsequence of two main sequences with some constraint that must be a substring of the result (STR-IC-LCS) was formulated recently. It is a variant of the constrained longest common subsequence problem. As the known algorithms for the STR-IC-LCS problem are cubic-time, the presented quadratic-time algorithm is signi cantly faster.

Gudys, A., Deorowicz, S., A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem Designed for GPUs, International Journal of Foundations of Computer Science, 2012; 23(4):877–902, Abstract.

Multiple sequence alignment (MSA) is one of the most important problems in computational biology. As availability of genomic and proteomic data constantly increases, new tools for processing this data in reasonable time are needed. One method of addressing this issue is parallelization. Nowadays, graphical processing units offer much more computational power than central processors, hence GPUs become more and more popular in computational-intensive tasks, including sequence alignment. We investigate the constrained multiple sequence alignment problem (CMSA) which allows some prior knowledge to be introduced to a final alignment. As a result we propose a GPU-parallel version of the Center Star algorithm which overtakes vastly its CPU-serial equivalent as well as the parallel version run on the quad-core processor. The speedups over CPU-serial algorithm were from 30 to 110 in the case of synthetic sets and from 55 to 75 for the real sequences obtained from the Pfam database.

Deorowicz, S., A Cover-Merging-Based Algorithm for the Longest Increasing Subsequence in a Sliding Window Problem, Computing and Informatics, 2012; 31(6):1217–1233, Abstract.

A longest increasing subsequence problem (LIS) is a well-known com- binatorial problem with applications mainly in bioinformatics, where it is used in various projects on DNA sequences. Recently, a number of generalisations of this problem were proposed. One of them is to find an LIS among all fixed-size windows of the input sequence (LISW). We propose an algorithm for the LISW problem ba- sed on cover representation of the sequence that outperforms the existing methods for some class of the input sequences.

2011

Deorowicz, S., Grabowski, Sz., Robust relative compression of genomes with random access, Bioinformatics, 2011; 27(21):2979–2986, Abstract.

Motivation: Storing, transferring and maintaining genomic databases becomes a major challenge because of the rapid technology progress in DNA sequencing and correspondingly growing pace at which the sequencing data are being produced. Efficient compression, with support for extraction of arbitrary snippets of any sequence, is the key to maintaining those huge amounts of data.
Results: We present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over an order of magnitude greater. In particular, 69 differentially encoded human genomes are compressed over 400 times at fast compression, or even 1000 times at slower compression (the reference genome itself needs much more space). Adding fast random access to text snippets decreases the ratio to ~300.
Availability: GDC is available at http://sun.aei.polsl.pl/gdc.

Deorowicz, S., Grabowski, Sz., Compression of DNA sequences in FASTQ format, Bioinformatics, 2011; 27(6):860–862, Abstract.

Motivation: Modern sequencing instruments are able to generate at least hundreds of millions short reads of genomic data. Those huge volumes of data require effective means to store them, provide quick access to any record and enable fast decompression.
Results: We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org).
Availability: DSRC is freely available at http://sun.aei.polsl.pl/dsrc.

Gudyś, A., Deorowicz, S., A Parallel GPU-Designed Algorithm for the Constrained Multiple Sequence Alignment Problem, Chapter in book Man-Machine Interactions 2, Series Advances in Intelligent and Soft Computing, Springer-Verlag Berlin Heidelberg, Czachórski, T., Kozielski, S., Stańczyk, U., Editors; 2011; 103:361–368, Abstract.

Modern graphical processing units (GPUs) offer much more computational power than modern CPUs, so it is natural that GPUs are often used for solving many computationally-intensive problems. One of the tasks of huge importance in bioinformatics is sequence alignment. We investigate its variant introduced a few years ago in which some additional requirement on the alignment is given. As a result we propose a parallel version of Center-Star algorithm computing the constrained multiple sequence alignment at the GPU. The obtained speedup over the serial CPU relative is in range [20, 200].

Pecka, P., Deorowicz, S., Nowak, M., Efficient Representation of Transition Matrix in the Markov Process Modeling of Computer Networks, Chapter in book Man-Machine Interactions 2, Series Advances in Intelligent and Soft Computing, Springer-Verlag Berlin Heidelberg, Czachórski, T., Kozielski, S., Stańczyk, U., Editors; 2011; 103:457–464, Abstract.

Markov chains are used in analysis in many fields. One of them is performance evaluation of computer systems, especially computer networks. For the analysis we use OLYMP object library that provides features to describe complex systems and find their statistical parameters. We present two compressed data structures for the most space-consuming parts of data processed by OLYMP. Then, we show how these improvements move the barrier of applicability of the utility.

2010

Deorowicz, S., Solving Longest Common Subsequence and Related Problems on Graphical Processing Units, Software–Practice and Experience, 2010; 40(8):673–700, Abstract.

Modern graphical processing units (GPUs) offer much more computational power than modern CPUs. Therefore, it is natural that GPUs are applied not only for their original purposes, but also for general processing (GPGPU). In the field of sequence processing, one of the most important problems is the measuring of sequence similarity. There are many sequence similarity measures, e.g., edit distance, longest common subsequence length, and their derivatives. We examine the possibility of speeding up the algorithms computing some of them. We chose three measures useful in different situations. The experimental results show that the GPU versions of the examined algorithms are faster than their serial counterparts by a factor between 4 and 65.

Deorowicz, S., Bit-Parallel Algorithm for the Constrained Longest Common Subsequence Problem, Fundamenta Informaticae, 2010; 99(4):409–433, Abstract.

The problem of finding a constrained longest common subsequence (CLCS) for the sequences A and B with respect to the sequence P was introduced recently. Its goal is to find a longest subsequence C of A and B such that P is a subsequence of C. Most of the algorithms solving the CLCS problem are based on dynamic programming. Bit-parallelism is a technique of using single bits in a machine word for concurrent computation. We propose the first bit-parallel algorithm computing a CLCS and/or its length which outperforms the other known algorithms in terms of speed.

Deorowicz, S., Obstój, J., Constrained Longest Common Subsequence Computing Algorithms in Practice, Computing and Informatics, 2010; 29(3):729–744, Abstract.

The problem of finding a constrained longest common subsequence (CLCS) for the sequences A and B with respect to the sequence P was introduced recently. Its goal is to find a longest subsequence C of A and B such that P is a subsequence of C. There are several algorithms solving the CLCS problem, but there is no real experimental comparison of them. The paper has two aims. Firstly, we propose an improvement to the algorithms by Chin et al. and Deorowicz based on an entry-exit points technique by He and Arslan. Secondly, we compare experimentally the existing algorithms for solving the CLCS problem.

2009

Deorowicz, S., On Some Variants of the Longest Increasing Subsequence Problem, Theoretical and Applied Informatics, 2009; 21(3–4):135–148, Abstract.

The problem of finding a longest increasing subsequence (LIS) is a well known task in sequence processing. We discuss a recently introduced variant of LIS, a minimal height longest increasing subsequence problem and propose a new algorithm for it, which improves its time complexity. Moreover, we define a family of similar problems and introduce algorithms solving them.

Deorowicz, S., Grabowski, Sz., A Hybrid Algorithm for the Longest Common Transposition-Invariant Subsequence Problem, Computing and Informatics, 2009; 28(5):729–744, Abstract.

The longest common transposition-invariant subsequence (LCTS) problem is a music information retrieval oriented variation of the classic LCS problem. There are basically only two known efficient approaches to calculate the length of the LCTS, one based on sparse dynamic programming and the other on bit-parallelism. In this work, we propose a hybrid algorithm picking the better of the two algorithms for individual subproblems. Experiments on music (MIDI), with 32-bit and 64-bit implementations, show that the proposed algorithm outperforms the faster of the two component algorithms by a factor of 1.4-2.0, depending on sequence lengths. Similar, if not better, improvements can be observed for random data with Gaussian distribution. Also for uniformly random data, the hybrid algorithm is the winner if the alphabet is neither too small (at least 32 symbols) nor too large (up to 128 symbols). Part of the success of our scheme is attributed to a quite robust component selection heuristic.

Deorowicz, S., Grabowski, Sz., On Two Variants of the Longest Increasing Subsequence Problem, Chapter in book Man-Machine Interactions, Series Advances in Intelligent and Soft Computing, Springer-Verlag Berlin Heidelberg, Cyran, K.A., Kozielski, S., Peters J.F., Stańczyk, U., Wakulicz-Deja, A., Editors; 2009; 59:541–549, Abstract.

Finding a longest increasing subsequence (LIS) of a given sequence is a classic problem in string matching, with applications mostly in computational biology. Recently, many variations of this problem have been introduced. We present new algorithms for two such problems: the longest increasing circular subsequence (LICS) and the slope-constrained longest increasing subsequence (SLIS). For LICS, our algorithm improves one of the most competitive techniques if the length of the output sequence is close to its expected value 2n^1/2+o(n^1/2). In the algorithm for SLIS, we show how to gain from modern successor search data structures, which is not trivial for this task.

Deorowicz, S., Computing the longest common transposition-invariant subsequence with GPU, Chapter in book Man-Machine Interactions, Series Advances in Intelligent and Soft Computing, Springer-Verlag Berlin Heidelberg, Cyran, K.A., Kozielski, S., Peters J.F., Stańczyk, U., Wakulicz-Deja, A., Editors; 2009; 59:551–559, Abstract.

Finding a longest common transposition-invariant subsequence (LCTS) of two given integer sequences A=a₁a₂...a_m and B=b₁b₂...b_n (a generalization of the well-known longest common subsequence problem (LCS)) has arisen in the field of music information retrieval. In the LCTS problem, we look for an LCS for the sequences A+t=(a₁+t)(a₂+t)...(a_m+t) and B where t is any integer. Performance of the top graphical processing units (GPUs) outgrew the performance of the top CPUs a few years ago and there is a surge of interest in recent years in using GPUs for general processing. We propose and evaluate a bit-parallel algorithm solving the LCTS problem on a GPU.

Deorowicz, S., An algorithm for solving the longest increasing circular subsequence problem, Information Processing Letters, 2009; 109(12):630–634, Abstract.

The problem of finding a longest increasing circular subsequence (LICS) consists in computing a longest increasing subsequence of any rotation of sequence S. We present an algorithm for with O(min(n l, n log n + l³ log n)) time complexity, where l is the LICS length, which is competitive to the best known ones. The main ideas of the proposed approach are to deal with the cover representation of the sequence, provide fast merging techniques of the covers, and to skip computation of longest increasing subsequence (LIS) for most rotations.

2008

Deorowicz, S., Algorithms solving the constrained longest common subsequence problem (in Polish), In Proceedings of SWD, 2008; 295–303, Abstract.

Problem wyznaczania najdłuższego wymuszonego wspólnego podciągu (NWWP) dla ciągów A, B, P polega na wyznaczeniu najdłuższego takiego podciągu C, który jest podciągiem A oraz B a także zawiera P jako podciąg. Problem ten jest uogólnieniem szeroko znanego problemu wyznaczania najdłuższego wspólnego podciągu (NWP) i pojawia się przede wszystkim przy porównywaniu sekwencji biologicznych (takich jak sekwencje DNA i białkowe). W niniejszej pracy omówiono istniejące algorytmy dla problemów NWP i NWWP, porównano praktyczną wydajność algorytmów dla problemu NWWP oraz przedstawiono perspektywy dalszych badań nad algorytmami dla problemu NWWP, w szczególności omówiono możliwość skonstruowania dla tego problemu algorytmu opartego na technice bitowej równoległości.

Grabowski Sz., Deorowicz, S., Efficient preprocessing for web log compression, International Scientific Journal of Computing, 2008; 7(1):35–42, Abstract.

Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. They are usually archived, as it enables further analysis, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client's IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary-based phrase sequence substitution, move-to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.70 times in case of gzip and 1.86 times in case of bzip2.

Deorowicz, S., Grabowski Sz., Sub-atomic field processing for improved web log compression, In Proceedings of 9th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET'2008), 2008; 551–556, Abstract.

Web log files, storing user activity on a server, may grow at the pace of hundreds of megabytes a day, or even more, on popular sites. It makes sense to archive old logs, to analyze them further, e.g., for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. Our method works on individual fields of log data (each storing such information like the client's IP, date/time, requested file or query, download size in bytes, etc.), and utilizes such compression techniques like finding and extracting common prefixes and suffixes, dictionary-based phrase sequence substitution, move-to-front coding, and more. The test results show the proposed transform improves the average compression ratios 2.64 times in case of gzip and 1.83 times in case of bzip2.
Full text in pdf

Grabowski Sz., Deorowicz, S., Nice To Be A Chimera: A Hybrid Algorithm For The Longest Common Transposition-Invariant Subsequence Problem, In Proceedings of 9th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET'2008), 2008; 50–54, Abstract.

2007

Deorowicz, S., Fast algorithm for the constrained longest common subsequence problem, Theoretical and Applied Informatics, 2007; 19(2):91–102, Abstract.

The problem of finding the constrained longest common subsequence (CLCS) for the sequences A and B with respect to the sequence P was introduced recently. Its goal is to find the longest subsequence of A and B such that P is a subsequence of it. The best known algorithms for its solving require time of order of a product of the sequence lengths. We introduce a novel approach in which time and space complexities depend also on the number of matches between A, B, and P. The time complexity is better for typical parameter settings and never worse.
Full text in pdf

Grabowski Sz., Deorowicz, S., Web log compression, AGH Automatyka, 2007; 11(3):417–424, Abstract.

Web log data store client activity on a particular server, usually in form of one-line "hits" with information like the client's IP, date/time, requested file or query, download size in bytes etc. Web logs of popular sites may grow at the pace of hundreds of megabytes a day, or even more. It makes sense to archive old logs, to analyze them further, e.g. for detecting attacks or other server abuse patterns. In this work we present a specialized lossless Apache web log preprocessor and test it with combination of several popular general-purpose compressors. The test results show the proposed transform improves the compression efficiency of general purpose compressors on average by 65% in case of gzip and 52% in case of bzip2.
Full text in pdf

2006

Deorowicz, S., Speeding up Transposition-Invariant String Matching, Information Processing Letters, 2006; 100(1):14–20, Abstract.

Finding the longest common subsequence (LCS) of two given sequences A=a₀ a₁ ... a_m-1 and B=b₀ b₁ ... b_n-1 is an important and well studied problem. We consider its generalization, transposition-invariant LCS (LCTS), which has recently arisen in the field of music information retrieval. In LCTS, we look for the longest common subsequence between the sequences A+t = (a₀+t)(a₁+t)... (a_m-1+t), and B where t is any integer. This means that shifting all the symbols in the A sequence by some value is allowed. We introduce a family of algorithms (motivated by the Hunt-Szymanski scheme for LCS), improving the currently best known complexity from O(mn log log σ) to O(D log log σ + mn), where σ is the alphabet size and D < mn is the total number of dominant matches for all transpositions. Then, we demonstrate experimentally that some of our algorithms outperform the best ones from literature.
Full text in pdf

2005

Deorowicz, S., Ciura, M., Correcting spelling errors by modelling their causes, International Journal of Applied Mathematics and Computer Science, 2005; 15(2):275–285, Abstract.

This paper accounts for a new technique of correcting isolated words in typed texts. A language-dependent set of string substitutions reflects the surface form of errors that result from vocabulary incompetence, misspellings, or mistypings. Candidate corrections are formed by applying the substitutions to text words absent from the computer lexicon. A minimal acyclic deterministic finite automaton storing the lexicon allows for quick rejecting of nonsense corrections, while costs associated with the substitutions serve to rank the remaining ones. A comparison of the correction lists generated by several spellcheckers for two corpora of English spelling errors shows that our technique suggests the right words more accurately than the others.
Full text in pdf
Substitution rules used in the paper (txt)
Wikipedia data set used in the work (txt)
Aspell data set used in the work (txt)

Deorowicz, S., Context exhumation after the Burrows–Wheeler transform, Information Processing Letters, 2005; 95(1):313–320, Abstract.

The essence of compression algorithms based on the Burrows–Wheeler transform is their first stage. In this stage, the information about the symbol contexts in the original sequence is lost and cannot be used in the rest of the algorithm. We show how to obtain some knowledge of the symbol contexts after the BWT. Using this information makes the prediction of symbol occurrence in further stages more accurate, which is confirmed by experiments.

Skibiński, P., Grabowski, Sz., Deorowicz, S., Revisiting dictionary-based compression, Software–Practice and Experience, 2005; 35(15):1455–1476, Abstract.

One of the attractive ways to increase text compression is to replace words with references to a text dictionary given in advance. Although there exist a few works in this area, they do not fully exploit the compression possibilities or consider alternative preprocessing variants for various compressors in the latter phase. In this paper, we discuss several aspects of dictionary-based compression, including compact dictionary representation, and present a PPM/BWCA oriented scheme, word replacing transformation, achieving compression ratios higher by 2–6% than state-of-the-art StarNT (2003) text preprocessor, working at a greater speed. We also present an alternative scheme designed for LZ77 compressors, with the advantage over StarNT reaching up to 14% in combination with gzip.
Full text in pdf
WRT source code and Win executable
WRT-LZ source code and Win executable

2002

Deorowicz, S., Second step algorithms in the Burrows–Wheeler compression algorithm, Software–Practice and Experience, 2002; 32(2):99–111, Abstract.

In this paper we fix our attention on the second step algorithms of the Burrows–Wheeler compression algorithm, which in the original version is the Move To Front algorithm. We discuss many of its replacements presented so far, and compare compression results obtained using them. Then we propose a new algorithm that yields a better compression ratio than the previous ones.
Full text in pdf

2001

Ciura, M.G., Deorowicz, S., How to squeeze a lexicon, Software–Practice and Experience, 2001; 31(11):1077–1090, Abstract.

Minimal acyclic deterministic finite automata (DFAs) can be used as a compact representation of string sets with fast access time. Creating them with traditional algorithms of DFA minimization consumes lots of resources when a large number of strings is involved. The goal of this paper is to popularize an efficient but little known algorithm for creating minimal acyclic DFAs recognizing a finite language, invented independently by several authors. We present this algorithm for three variants of DFAs, discuss its minor improvements, and compare DFAs and other data structures.
Full text in PostScript
Source code in ANSI C

2000

Deorowicz, S., Improvements to Burrows–Wheeler compression algorithm, Software–Practice and Experience, 2000; 30(13):1465–1483, Abstract.

In 1994 Burrows and Wheeler presented a new algorithm for lossless data compression. The compression ratio, which can be achieved using their algorithm is comparable with the best known other algorithms, whilst its complexity is relatively small. In this paper we explain the internals of this algorithm and discuss its various modifications which have been presented so far. Next we offer new improvements for its effectiveness. They allow us for obtaining the compression ratio equal to 2.271 bpc for Calgary Corpus files, which is the best result in the class of Burrows–Wheeler Transform based algorithms.
Full text in PostScript

1997

Deorowicz, S., Solving the Delivery Problem Using k-Optimal Algorithms (in Polish), Zeszyty Naukowe Politechniki Śląskiej, Seria Informatyka, 1997; 33(1):139–146, Abstract.

The Delivery Problem is a difficult combinatorial optimization problem that belongs to the NP-complete group. Since k-optimal algorithms are suitable for solving the problems of this class, we have tried to adopt these algorithms to the delivery problem. In the paper the computational results of some modifications of the k-optimal algorithm for solving the delivery problem for sets of various sizes are presented.

Books

2007

Czech, Z.J., Fabian, P., Deorowicz, S., Algorytmy i struktury danych. Wybrane zagadnienia (in Polish), Wydawnictwo Politechniki Śląskiej, 2007; Abstract.

Skrypt nr 2400.

DSc dissertation

2011

Deorowicz, S., Sequential and parallel algorithms for subsequence finding, Silesian University of Technology, 2011; Abstract.

PhD dissertation

2003

Deorowicz, S., Universal lossless data compression algorithms, Silesian University of Technology, 2003; Abstract.

The disseration concerns universal lossless data compression algorithms such as LZ, PPM, and BWCA methods. A new algorithm based on the Burrows–Wheeler transform is proposed. Its most important features are improved Itoh–Tanaka method for computing BWT, Weighted Frequency Count transform (instead of the MTF), and weighted probability estimation. The performance of the algorithm is evaluated on the Calgary and Silesia corpora.
Full text in pdf
Errata in pdf

Master's thesis

1998

Deorowicz, S., Heurystyczne i genetyczne algorytmy rozwiązywania problemu dostawy (in Polish), Silesian University of Technology, 1998; Abstract.

Technical reports

2009

Deorowicz, S., Obstój, J., Constrained Longest Common Subsequence Computing Algorithms in Practice, 2009; Abstract.

The problem of finding a constrained longest common subsequence (CLCS) for the sequences A and B with respect to the sequence P was introduced recently. Its goal is to find a longest subsequence C of A and B such that P is a subsequence of C. There are several algorithms solving the CLCS problem, but there is no real experimental comparison of them. The paper has two aims. Firstly, we propose an improvement to the algorithms by Chin et al. and Deorowicz based on an entryexit points technique by He and Arslan. Secondly, we compare experimentally the existing algorithms for solving the CLCS problem.
Full text in pdf

2005

Deorowicz, S., Fast algorithm for constrained longest common subsequence problem, 2005; Abstract.

The problem of finding the longest constrained common subsequence (CLCS) for the sequences A and B with respect to the sequence P was introduced recently. The best known algorithms for its solving requires time of order of a product of the sequences length. We introduce a novel approach in which time and memory complexities depends on the number of matches between A, B, and P. The time complexity is never worse than the one of the known algorithms, but typically is better. Experiments show that our method is faster than the known ones and requires less memory.
Full text in pdf

Deorowicz, S., Speeding up Transposition-Invariant String Matching, 2005; Abstract.

Finding the longest common subsequence (LCS) of two given sequences A=a₀ a₁ ... a_m-1 and B=b₀ b₁ ... b_n-1 is an important and well studied problem. We consider its generalization, transposition-invariant LCS (LCTS), which has recently arisen in the field of music information retrieval. In LCTS, we look for the longest common subsequence between the sequences A+t = (a₀+t)(a₁+t)... (a_m-1+t), and B where t is some integer. This means that shifting all the symbols in the A sequence by some value is allowed. We present two new algorithms, matching the currently best known complexity O(mn log log σ), where σ is the alphabet size. Then, we show in the experiments that our algorithms outperform the best ones from literature.
Full text in pdf

2004

Navarro, G., Grabowski, Sz., Mäkinen, V., Deorowicz, S., Improved Time and Space Complexities for Transposition Invariant String Matching, 2004; Abstract.

Given strings A=a₁ a₂ ... a_m and B=b₁ b₂ ... b_n over a finite alphabet Σ⊂Z of size O(σ), and a distance d() defined among strings, the transposition invariant version of d() is d^t(A, B)=min_t∈Z(A+ t, B), where A+t=(a₁+t)(a₂+t)...(a_m+t). Distances d() of most interest are Levenshtein distance and indel distance (the dual of the Longest Common Subsequence), which can be computed in O(mn) time. Recent algorithms compute d^t(A, B) in O(mn log log min(m,n)) time for those distances. In this paper we show how those complexities can be reduced to O(mn log log σ). Furthermore, we reduce the space requirements from O(mn) to O(σ²+min(m,n)).
Full text in pdf