Kontakt
Materiały do zajęć
Materiały arch.
Dyplomy

Deorowicz, S., *FQSqueezer: k-mer-based compression of sequencing data*,
bioRxiv.org,
2019; ():Abstract.

**Motivation**: The amount of genomic data that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives.

**Results**: We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.

Deorowicz, S., Gudys, A., *Whisper 2: indel-sensitive short read mapping*,
bioRxiv.org,
2019; :Abstract.

**Summary**: Whisper 2 is a short-read-mapping software providing superior quality of indel variant calling. Its running times place it among the fastest existing tools.

Deorowicz, S., Walczyszyn, J., Debudaj-Grabysz, A., *CoMSA: compression of protein multiple sequence alignment files*,
Bioinformatics,
2019; 35(2):22–234, Abstract.

**Motivation**: Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago.
Among numerous bioinformatics processes generating hundreds of GB ismultiple sequence alignments of
protein families. Its largest database, i.e., Pfam, consumes 40–230GB, depending of the variant. Storage
and transfer of such massive data has become a challenge.

**Results**: We propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor),
designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler
transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six
times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments
resulted in an analysis of the influence of a protein family size on the compression ratio.

Kawulok, J., Kawulok, M., Deorowicz, S., *Environmental metagenome classification for constructing a microbiome fingerprint*,
Biology Direct,
2019; 14(20):Abstract.

**Background**: Nowadays, not only are single genomes commonly analyzed, but also metagenomes, which are sets of, DNA fragments (reads) derived from microbes living in a given environment. Metagenome analysis is aimed at extracting crucial information on the organisms that have left their traces in an investigated environmental sample.In this study we focus on the MetaSUB Forensics Challenge (organized within the CAMDA 2018 conference) which consists in predicting the geographical origin of metagenomic samples. Contrary to the existing methods for environmental classification that are based on taxonomic or functional classification, we rely on the similarity between a sample and the reference database computed at a reads level.

**Results**: We report the results of our extensive experimental study to investigate the behavior of our method and its sensitivity to different parameters. In our tests, we have followed the protocol of the MetaSUB Challenge, which allowed us to compare the obtained results with the solutions based on taxonomic and functional classification.

**Conclusions**: The results reported in the paper indicate that our method is competitive with those based on taxonomic classification. Importantly, by measuring the similarity at the reads level, we avoid the necessity of using large databases with annotated gene sequences. Hence our main finding is that environmental classification of metagenomic data can be proceeded without using large databases required for taxonomic or functional classification.

Deorowicz, S., Danek, A., *GTShark: Genotype compression in large projects*,
Bioinformatics,
2019; 35(22):4791–4793, Abstract.

**Summary**:
Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e. squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results.

**Availability and implementation**: https://github.com/refresh-bio/GTShark.

Deorowicz, S., Gudys, A., Dlugosz, M., Kokot, M., Danek, A., *Kmer-db: instant evolutionary distance estimation*,
Bioinformatics,
2019; 35(1):133–136, Abstract.

**Summary**:
Kmer-db is a new tool for estimating evolutionary relationship on the basis of k-mers
extracted from genomes or sequencing reads.
Thanks to an efficient data structure and parallel implementation,
our software estimates distances between 40,715 pathogens in <7 min (on a modern workstation),
26 times faster than Mash, its main competitor.

**Availability and implementation**:
https://github.com/refresh-bio/kmer-db.

Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., *Whisper: read sorting allows robust robust mapping of DNA sequencing data*,
Bioinformatics,
2019; 35(12):2043–2050, Abstract.

**Motivation**: Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time.

**Results**: We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline.

Deorowicz, S., Grabowski, Sz., *DeltaComp: Fast and efficient compression of astronomical timelines*,
New Astronomy,
2018; 65():59–66, Abstract.

Roguski, L., Ochoa, I., Hernaez, M., Deorowicz, S., *FaStore: a space-saving solution for raw sequencing data*,
Bioinformatics,
2018; 34(16):2748–2756, Abstract.

**Motivation**:
The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs.

**Results**:
FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance.

**Availability and implementation**: FaStore can be downloaded from
https://github.com/refresh-bio/FaStore.

Danek, A., Deorowicz, S., *GTC: how to maintain huge genotype collections in a compressed form*,
Bioinformatics,
2018; 34(11):1834–1840, Abstract.

**Motivation**:
Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial.

**Results**:
We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second.

**Availability and implementation**: GTC can be downloaded from
https://github.com/refresh-bio/GTC or
http://sun.aei.polsl.pl/REFRESH/gtc.

Kokot, M., Deorowicz, S., Długosz, M., *Even Faster Sorting of (Not Only) Integers*,
Chapter in book *Man-Machine Interactions 5*,
Series *Advances in Intelligent Systems and Computing*,
Springer International Publishing Switzerland,
Gruca, A., Czachorski, T., Harezlak, K., Kozielski, S., Piotrowska, A., Editors; 2018; 659:481–491, Abstract.

Długosz, M., Deorowicz, S., Kokot, M., *Improvements in DNA Reads Correction*,
Chapter in book *Man-Machine Interactions 5*,
Series *Advances in Intelligent Systems and Computing*,
Springer International Publishing Switzerland,
Gruca, A., Czachorski, T., Harezlak, K., Kozielski, S., Piotrowska, A., Editors; 2018; 659:115–124, Abstract.

Kokot, M., Długosz, M., Deorowicz, S., *KMC 3: counting and manipulating k -mer statistics*,
Bioinformatics,
2017; 33(17):2759–2761, Abstract.

**Summary**: Counting all *k*-mers in a given dataset is a standard procedure in many bioinformatics applications.
We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating *k*-mer databases.
Usefulness of the tools is shown on a few real problems.

**Availability**: Program is freely available at http://sun.aei.polsl.pl/REFRESH/kmc.

Gudys, A., Deorowicz, S., *QuickProbs 2: towards rapid construction of high-quality alignments of large protein families*,
Scientific Reports,
2017; 7(41553):Abstract.

Maciej Długosz, M., Deorowicz, S., *RECKONER: read error corrector based on KMC*,
Bioinformatics,
2017; 33(7):1086–1089, Abstract.

**Summary**: Presence of sequencing errors in data produced by next-generation sequencers affects quality of downstream
analyzes.
Accuracy of them can be improved by performing error correction of sequencing reads.
We introduce a new correction algorithm capable of processing eukaryotic close to 500 Mbp-genome-size,
high error-rated data using less than 4GB of RAM in about 35min on 16-core computer.

**Availability and Implementation**: Program is freely available at http://sun.aei.polsl.pl/REFRESH/reckoner.

**Contact**: sebastian.deorowicz@polsl.pl

**Supplementary information**: Supplementary data are available at Bioinformatics online.

Grabowski, Sz., Raniszewski, M., Deorowicz, S., *FM-index for Dummies*,
Chapter in book *Beyond Databases, Architectures, and Structures*,
Series *Communications in Computer and Information Science*,
Springer International Publishing Switzerland,
Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2017; 718:189–201, Abstract.

Kokot, M., Deorowicz, S., Debudaj-Grabysz, A., *Sorting Data on Ultra-Large Scale with RADULS: New Incarnation of Radix Sort*,
Chapter in book *Beyond Databases, Architectures, and Structures*,
Series *Communications in Computer and Information Science*,
Springer International Publishing Switzerland,
Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2017; 718:235–245, Abstract.

Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., *FAMSA: Fast and accurate multiple sequence alignment of huge protein families*,
Scientific Reports,
2016; 6(33964):Abstract.

Deorowicz, S., Grabowski, Sz., Ochoa, I., Hernaez, M., Weissman, T., *Comment on: "ERGC: An efficient referential genome compression algorithm"*,
Bioinformatics,
2016; 32(7):1115–1117, Abstract.

**Motivation**: Data compression is crucial in effective handling of genomic data.
Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors.

**Results**: We evaluated ERGC and the previously proposed algorithms GDC and iDoComp,
which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies
of human genome (instead of only four of them in the original paper).
ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case
for only the two Korean genomes).
In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp.

Kabza, M., Kubiak, M.R., Danek, A., Rosikiewicz, W., Deorowicz, S., Polański, A., Makałowska, I., *Inter-population Differences in Retrogene Loss and Expression in Humans*,
PLOS Genetics,
2015; 11(10):e1005579, Abstract.

Kowalski, T., Grabowski, Sz., Deorowicz, S., *Indexing arbitrary-length k-mers in sequencing reads*,
PLOS ONE,
2015; 10(7):1–16, Abstract.

Deorowicz, S., Danek, A., Niemiec, M., *GDC 2: Compression of large collections of genomes*,
Scientific Reports,
2015; 5(11565):1–12, Abstract.

Kawulok, J., Deorowicz, S., *CoMeta: Classification of Metagenomes Using k-mers*,
PLOS ONE,
2015; 10(4):1–23, Abstract.

*k*-mers) and fast program for indexing large set of *k*-mers.
In contrast to the most popular methods based on BLAST, where the query is compared with each reference sequence, we begin the classification from the top of the taxonomy tree to reduce the number of comparisons.
The presented experimental study confirms that CoMeta outperforms other programs used in this context.
CoMeta is available at https://github.com/jkawulok/cometa under a free GNU GPL 2 license.

Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., *KMC 2: Fast and resource-frugal k-mer counting*,
Bioinformatics,
2015; 31:1569–1576, Abstract.

**Motivation**: Building the histogram of occurrences of every *k*-symbol
long substring of nucleotide data is a standard step in many bioinformatics
applications, known under the name of *k*-mer counting. Its
applications include developing de Bruijn graph genome assemblers,
fast multiple sequence alignment and repeat detection. The tremendous
amounts of NGS data require fast algorithms for *k*-mer counting,
preferably using moderate amounts of memory.

**Results**: We present a novel method for *k*-mer counting, on large
datasets about twice faster than the strongest competitors (Jellyfish
2, KMC 1), using about 12GB (or less) of RAM. Our disk-based
method bears some resemblance to MSPKmerCounter, yet replacing
the original minimizers with signatures (a carefully selected subset
of all minimizers) and using (*k*; *x*)-mers allows to significantly reduce
the I/O, and a highly parallel overall architecture allows to achieve
unprecedented processing speeds. For example, KMC 2 counts the
28-mers of a human reads collection with 44-fold coverage (106GB
of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with
an SSD.

**Availability**: KMC 2 is freely available at http://sun.aei.polsl.pl/kmc.

Grabowski, Sz., Deorowicz, S., Roguski, L., *Disk-based compression of data from genome sequencing*,
Bioinformatics,
2015; 31:1389–1395, Abstract.

**Motivation**: High-coverage sequencing data have significant, yet hard to exploit, redundancy.
Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy
between overlapping reads cannot be easily captured in the (relatively small) main memory.
More interesting solutions for this problem are disk-based (Yanovsky, 2011; Cox et al., 2012),
where the better of these two, from Cox et al. (2012), is based on the BurrowsWheeler transform (BWT)
and achieves 0.518 bits per base for a 134.0 Gb human genome sequencing collection with almost 45-fold coverage.

**Results**: We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only).
Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gb dataset into only 5.31GB of space.

**Availability**: http://sun.aei.polsl.pl/orcom under a free license.

**Supplementary data**: available at Bioinformatics online.

Danek, A., Deorowicz, S., Grabowski, Sz., *Indexes of Large Genome Collections on a PC*,
PLOS ONE,
2014; 9(10):e109384, Abstract.

Deorowicz, S., Grabowski, Sz., *Efficient algorithms for the longest common subsequence in k-length substrings*,
Information Processing Letters,
2014; 114(11):634–638, Abstract.

*k*-length substrings
(LCS*k*) is a recently proposed problem motivated by computational
biology. This is a generalization of the well-known LCS problem
in which matching symbols from two sequences *A* and *B* are replaced
with matching non-overlapping substrings of length *k* from *A* and *B*. We
propose several algorithms for LCS*k*, being non-trivial incarnations of
the major concepts known from LCS research (dynamic programming,
sparse dynamic programming, tabulation). Our algorithms make use of
a linear-time and linear-space preprocessing finding the occurrences of
all the substrings of length *k* from one sequence in the other sequence.

Roguski, L., Deorowicz, S., *DSRC 2: Industry-oriented compression of FASTQ files*,
Bioinformatics,
2014; 30(15):2213–2215, Abstract.

**Summary:**
Modern sequencing platforms produce huge amounts of data.
Archiving them raises major problems but is crucial for reproducibility of results, one of the most fundamental principles of science.
The widely used gzip compressor, used for reduction of storage and transfer costs, is not a perfect solution,
so a few specialized FASTQ compressors were proposed recently.
Unfortunately, they are often impractical due to slow processing, lack of support for some variants of FASTQ files, or instability.
We propose DSRC 2 that offers compression ratios comparable to the best existing solutions, while being a few times faster and more flexible.

**Availability and Implementation:**
DSRC 2 is freely available at http://sun.aei.polsl.pl/dsrc.
The package contains: command-line compressor, C++ and Python libraries for easy integration with existing software,
technical documentation with examples of usage.

Gudys, A., Deorowicz, S., *QuickProbs – A Fast Multiple Sequence Alignment
Algorithm Designed for Graphics Processors*,
PLOS ONE,
2014; 9(2):e88901, Abstract.

Wieczorek, B., Polomski, M., Pecka, P., Deorowicz, S., *An Effective Way of Storing and Accessing Very Large Transition Matrices Using Multi-core CPU and GPU Architectures*,
Chapter in book *Beyond Databases, Architectures, and Structures*,
Series *Communications in Computer and Information Science*,
Springer International Publishing Switzerland,
Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2014; 424:323–334, Abstract.

Kawulok, J., Deorowicz, S., *An Improved Algorithm for Fast and Accurate Classification of Sequences*,
Chapter in book *Beyond Databases, Architectures, and Structures*,
Series *Communications in Computer and Information Science*,
Springer International Publishing Switzerland,
Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2014; 424:335–344, Abstract.

Danek, A., Deorowicz, S., *Bit-Parallel Algorithm for the Block Variant of the Merged Longest Common Subsequence Problem*,
Chapter in book *Man-Machine Interactions 3*,
Series *Advances in Intelligent Systems and Computing*,
Springer-Verlag Berlin Heidelberg,
Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:173–181, Abstract.

Deorowicz, S., Debudaj-Grabysz, A., Gudyś, A., *Kalign-LCS – A More Accurate and Faster Variant of Kalign2 Algorithm for the Multiple Sequence Alignment Problem*,
Chapter in book *Man-Machine Interactions 3*,
Series *Advances in Intelligent Systems and Computing*,
Springer-Verlag Berlin Heidelberg,
Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:495–502, Abstract.

Deorowicz, S., Grabowski, Sz., *Subcubic Algorithms for the Sequence Excluded LCS Problem*,
Chapter in book *Man-Machine Interactions 3*,
Series *Advances in Intelligent Systems and Computing*,
Springer-Verlag Berlin Heidelberg,
Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:503–510, Abstract.

*O*(*nmr*) time, where *n* and *m* are the two sequence lengths, and *r* is the length of the constraining sequence. We present two algorithms, with time complexities of *O*(*nmr*/log *n*) and *O*(*nmr*/log^{3/2} *n*), respectively, where the better result is possible if *m*/*r* = log^{O(1)} *n*.

Susik, R., Grabowski, Sz., Deorowicz, S., *Fast and Simple Circular Pattern Matching*,
Chapter in book *Man-Machine Interactions 3*,
Series *Advances in Intelligent Systems and Computing*,
Springer-Verlag Berlin Heidelberg,
Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:537–544, Abstract.

*P* in text *T*, both over a common alphabet. The pattern and any of its rotations are also called conjugates in the literature. For the online version of this problem we present a new general approach and use several matching techniques as components, based on bit-parallelism and filtering. The experimental results show the effectiveness of the method, with matching speeds reaching 7-8 GB/s for long patterns and natural language or protein data.

Deorowicz, S., Danek, A., *Bit-Parallel Algorithms for the Merged Longest Common Subsequence Problem*,
International Journal of Foundations of Computer Science,
2013; 24(8):1281–1298, Abstract.

Deorowicz, S., Grabowski, Sz., *Data compression for sequencing data*,
Algorithms for Molecular Biologgy,
2013; 8():Article no. 25, Abstract.

Deorowicz, S., Debudaj-Grabysz, A., Grabowski, Sz., *Disk-based k-mer counting on a PC*,
BMC Bioinformatics,
2013; 14():Article no. 160, Abstract.

**Background**
The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.

**Results**
We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.

**Conclusions**
By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.

Deorowicz, S., Danek, A., Grabowski, Sz., *Genome compression: a novel approach for large collections*,
Bioinformatics,
2013; 29(20):2572–2578, Abstract.

**Motivation**:
Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms.

**Results**:
We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ?400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions.

**Availability**: http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license.

Roguski, L., Deorowicz, S., *Fluid motion modelling using vortex particle method on GPU*,
Studia Informatica, 2013; 34(1 (110)):31–48, Abstract.

Szczesniak, M.W., Deorowicz, S., Gapski, J., Kaczynski, L., Makalowska, I., *miRNEST database: an integrative approach in microRNA search and annotation*,
Nucleic Acids Research,
2012; 40(D1):D198–D204, Abstract.

Deorowicz, S., *Quadratic-time algorithm for a string constrained LCS problem*,
Information Processing Letters,
2012; 112(11):423–426, Abstract.

Gudys, A., Deorowicz, S., *A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem Designed for GPUs*,
International Journal of Foundations of Computer Science, 2012; 23(4):877–902, Abstract.

Deorowicz, S., *A Cover-Merging-Based Algorithm for the Longest Increasing Subsequence in a Sliding Window Problem*,
Computing and Informatics, 2012; 31(6):1217–1233, Abstract.

Deorowicz, S., Grabowski, Sz., *Robust relative compression of genomes with random access*,
Bioinformatics,
2011; 27(21):2979–2986, Abstract.

**Motivation**: Storing, transferring and maintaining genomic databases becomes a major challenge because of the rapid technology progress in DNA sequencing and correspondingly growing pace at which the sequencing data are being produced. Efficient compression, with support for extraction of arbitrary snippets of any sequence, is the key to maintaining those huge amounts of data.

**Results**: We present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over an order of magnitude greater. In particular, 69 differentially encoded human genomes are compressed over 400 times at fast compression, or even 1000 times at slower compression (the reference genome itself needs much more space). Adding fast random access to text snippets decreases the ratio to ~300.

**Availability**: GDC is available at http://sun.aei.polsl.pl/gdc.

Deorowicz, S., Grabowski, Sz., *Compression of DNA sequences in FASTQ format*,
Bioinformatics,
2011; 27(6):860–862, Abstract.

**Motivation**: Modern sequencing instruments are able to generate at least hundreds of millions short reads of genomic data. Those huge volumes of data require effective means to store them, provide quick access to any record and enable fast decompression.

**Results**: We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org).

**Availability**: DSRC is freely available at http://sun.aei.polsl.pl/dsrc.

Gudyś, A., Deorowicz, S., *A Parallel GPU-Designed Algorithm for the Constrained Multiple Sequence Alignment Problem*,
Chapter in book *Man-Machine Interactions 2*,
Series *Advances in Intelligent and Soft Computing*,
Springer-Verlag Berlin Heidelberg,
Czachórski, T., Kozielski, S., Stańczyk, U., Editors; 2011; 103:361–368, Abstract.

Pecka, P., Deorowicz, S., Nowak, M., *Efficient Representation of Transition Matrix in the Markov Process Modeling of Computer Networks*,
Chapter in book *Man-Machine Interactions 2*,
Series *Advances in Intelligent and Soft Computing*,
Springer-Verlag Berlin Heidelberg,
Czachórski, T., Kozielski, S., Stańczyk, U., Editors; 2011; 103:457–464, Abstract.

Deorowicz, S., *Solving Longest Common Subsequence and Related Problems on Graphical Processing Units*,
Software–Practice and Experience,
2010; 40(8):673–700, Abstract.

Deorowicz, S., *Bit-Parallel Algorithm for the Constrained Longest Common Subsequence Problem*,
Fundamenta Informaticae,
2010; 99(4):409–433, Abstract.

*A* and *B* with respect to the sequence *P* was introduced recently.
Its goal is to find a longest subsequence *C* of *A* and *B* such that *P* is a subsequence of *C*.
Most of the algorithms solving the CLCS problem are based on dynamic programming.
Bit-parallelism is a technique of using single bits in a machine word for concurrent computation.
We propose the first bit-parallel algorithm computing a CLCS and/or its length which outperforms the other known algorithms in terms of speed.

Deorowicz, S., Obstój, J., *Constrained Longest Common Subsequence Computing Algorithms in Practice*,
Computing and Informatics,
2010; 29(3):729–744, Abstract.

*A* and *B* with respect to the sequence *P* was introduced recently.
Its goal is to find a longest subsequence *C* of *A* and *B* such that *P* is a subsequence of *C*.
There are several algorithms solving the CLCS problem, but there is no real experimental comparison of them.
The paper has two aims.
Firstly, we propose an improvement to the algorithms by Chin *et al.* and Deorowicz based on an entry-exit points technique by He and Arslan.
Secondly, we compare experimentally the existing algorithms for solving the CLCS problem.

Deorowicz, S., *On Some Variants of the Longest Increasing Subsequence Problem*,
Theoretical and Applied Informatics,
2009; 21(3–4):135–148, Abstract.

Deorowicz, S., Grabowski, Sz., *A Hybrid Algorithm for the Longest Common Transposition-Invariant Subsequence Problem*,
Computing and Informatics,
2009; 28(5):729–744, Abstract.

Deorowicz, S., Grabowski, Sz., *On Two Variants of the Longest Increasing Subsequence Problem*,
Chapter in book *Man-Machine Interactions*,
Series *Advances in Intelligent and Soft Computing*,
Springer-Verlag Berlin Heidelberg,
Cyran, K.A., Kozielski, S., Peters J.F., Stańczyk, U., Wakulicz-Deja, A., Editors; 2009; 59:541–549, Abstract.

^{1/2}+o(n^{1/2}).
In the algorithm for SLIS, we show how to gain from modern successor search data
structures, which is not trivial for this task.

Deorowicz, S., *Computing the longest common transposition-invariant subsequence with GPU*,
Chapter in book *Man-Machine Interactions*,
Series *Advances in Intelligent and Soft Computing*,
Springer-Verlag Berlin Heidelberg,
Cyran, K.A., Kozielski, S., Peters J.F., Stańczyk, U., Wakulicz-Deja, A., Editors; 2009; 59:551–559, Abstract.

*A*=*a _{1}a_{2}...a_{m}* and

Deorowicz, S., *An algorithm for solving the longest increasing circular subsequence problem*,
Information Processing Letters,
2009; 109(12):630–634, Abstract.

*S*.
We present an algorithm for with O(min(n *l*, n log n + *l*^{3} log n)) time complexity, where *l* is the LICS length, which is competitive to the best known ones.
The main ideas of the proposed approach are to deal with the cover representation of the sequence, provide fast merging techniques of the covers, and to skip computation of longest increasing subsequence (LIS) for most rotations.

Deorowicz, S., *Algorithms solving the constrained longest common subsequence problem* (in Polish),
In Proceedings of SWD, 2008; 295–303, Abstract.

*A*, *B*, *P* polega na wyznaczeniu najdłuższego
takiego podciągu *C*, który jest podciągiem *A* oraz *B*
a także zawiera *P* jako podciąg. Problem ten jest uogólnieniem szeroko
znanego problemu wyznaczania najdłuższego wspólnego podciągu (NWP) i pojawia
się przede wszystkim przy porównywaniu sekwencji biologicznych (takich jak
sekwencje DNA i białkowe). W niniejszej pracy omówiono istniejące algorytmy
dla problemów NWP i NWWP, porównano praktyczną wydajność algorytmów dla
problemu NWWP oraz przedstawiono perspektywy dalszych badań nad algorytmami
dla problemu NWWP, w szczególności omówiono możliwość skonstruowania dla
tego problemu algorytmu opartego na technice bitowej równoległości.

Grabowski Sz., Deorowicz, S., *Efficient preprocessing for web log compression*,
International Scientific Journal of Computing,
2008; 7(1):35–42, Abstract.

Deorowicz, S., Grabowski Sz., *Sub-atomic field processing for improved web log compression*,
In Proceedings of 9th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET'2008),
2008; 551–556, Abstract.

Full text in pdf

Grabowski Sz., Deorowicz, S., *Nice To Be A Chimera: A Hybrid Algorithm For The Longest Common Transposition-Invariant Subsequence Problem*,
In Proceedings of 9th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET'2008),
2008; 50–54, Abstract.

Full text in pdf

Deorowicz, S., *Fast algorithm for the constrained longest common subsequence problem*,
Theoretical and Applied Informatics,
2007; 19(2):91–102, Abstract.

*A* and *B* with respect to the sequence *P* was introduced recently.
Its goal is to find the longest subsequence of *A* and *B* such that *P* is a subsequence of it.
The best known algorithms for its solving require time of order of a product of the sequence lengths.
We introduce a novel approach in which time and space complexities depend also on the number of matches between
*A*, *B*, and *P*.
The time complexity is better for typical parameter settings and never worse.

Full text in pdf

Grabowski Sz., Deorowicz, S., *Web log compression*,
AGH Automatyka,
2007; 11(3):417–424, Abstract.

Full text in pdf

Deorowicz, S., *Speeding up Transposition-Invariant String Matching*,
Information Processing Letters,
2006; 100(1):14–20, Abstract.

*A=a _{0} a_{1} ... a_{m-1}* and

Full text in pdf

Deorowicz, S., Ciura, M., *Correcting spelling errors by modelling their causes*,
International Journal of Applied Mathematics and Computer Science,
2005; 15(2):275–285, Abstract.

Full text in pdf

Substitution rules used in the paper (txt)

Wikipedia data set used in the work (txt)

Aspell data set used in the work (txt)

Deorowicz, S., *Context exhumation after the Burrows–Wheeler transform*,
Information Processing Letters,
2005; 95(1):313–320, Abstract.

Skibiński, P., Grabowski, Sz., Deorowicz, S., *Revisiting dictionary-based compression*,
Software–Practice and Experience,
2005; 35(15):1455–1476, Abstract.

Full text in pdf

WRT source code and Win executable

WRT-LZ source code and Win executable

Deorowicz, S., *Second step algorithms in the Burrows–Wheeler compression algorithm*,
Software–Practice and Experience,
2002; 32(2):99–111, Abstract.

Full text in pdf

Ciura, M.G., Deorowicz, S., *How to squeeze a lexicon*,
Software–Practice and Experience,
2001; 31(11):1077–1090, Abstract.

Full text in PostScript

Source code in ANSI C

Deorowicz, S., *Improvements to Burrows–Wheeler compression algorithm*,
Software–Practice and Experience,
2000; 30(13):1465–1483, Abstract.

Full text in PostScript

Deorowicz, S., *Solving the Delivery Problem Using k-Optimal Algorithms* (in Polish),
Zeszyty Naukowe Politechniki Śląskiej, Seria Informatyka, 1997; 33(1):139–146, Abstract.

Czech, Z.J., Fabian, P., Deorowicz, S., *Algorytmy i struktury danych. Wybrane zagadnienia* (in Polish),
Wydawnictwo Politechniki Śląskiej,
2007; Abstract.

Deorowicz, S., *Sequential and parallel algorithms for subsequence finding*,
Silesian University of Technology, 2011; Abstract.

Deorowicz, S., *Universal lossless data compression algorithms*,
Silesian University of Technology, 2003; Abstract.

Full text in pdf

Errata in pdf

Deorowicz, S., *Heurystyczne i genetyczne algorytmy rozwiązywania problemu dostawy* (in Polish),
Silesian University of Technology, 1998; Abstract.

Deorowicz, S., Obstój, J., *Constrained Longest Common Subsequence Computing Algorithms in Practice*,
2009; Abstract.

*A* and *B* with respect to the sequence *P* was introduced recently. Its
goal is to find a longest subsequence *C* of *A* and *B* such that *P* is a subsequence
of *C*. There are several algorithms solving the CLCS problem, but there is no real
experimental comparison of them. The paper has two aims. Firstly, we propose an
improvement to the algorithms by Chin et al. and Deorowicz based on an entryexit
points technique by He and Arslan. Secondly, we compare experimentally the
existing algorithms for solving the CLCS problem.

Full text in pdf

Deorowicz, S., *Fast algorithm for constrained longest common subsequence problem*,
2005; Abstract.

*A* and *B*
with respect to the sequence *P* was introduced recently.
The best known algorithms for its solving requires time of order of a product of the sequences length.
We introduce a novel approach in which time and memory complexities depends on the number of matches between *A*,
*B*, and *P*.
The time complexity is never worse than the one of the known algorithms, but typically is better.
Experiments show that our method is faster than the known ones and requires less memory.

Full text in pdf

Deorowicz, S., *Speeding up Transposition-Invariant String Matching*,
2005; Abstract.

*A=a _{0} a_{1} ... a_{m-1}* and

Full text in pdf

Navarro, G., Grabowski, Sz., Mäkinen, V., Deorowicz, S., *Improved Time and Space Complexities for Transposition Invariant String Matching*,
2004; Abstract.

*A=a _{1} a_{2} ... a_{m}* and

Full text in pdf