Deorowicz, S., FQSqueezer: k-mer-based compression of sequencing data, bioRxiv.org, 2019; ():Abstract.
Results: We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools.
Deorowicz, S., Gudys, A., Whisper 2: indel-sensitive short read mapping, bioRxiv.org, 2019; :Abstract.
Deorowicz, S., Walczyszyn, J., Debudaj-Grabysz, A., CoMSA: compression of protein multiple sequence alignment files, Bioinformatics, 2019; 35(2):22–234, Abstract.
Results: We propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor),
designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler
transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six
times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments
resulted in an analysis of the influence of a protein family size on the compression ratio.
Kawulok, J., Kawulok, M., Deorowicz, S., Environmental metagenome classification for constructing a microbiome fingerprint, Biology Direct, 2019; 14(20):Abstract.
Results: We report the results of our extensive experimental study to investigate the behavior of our method and its sensitivity to different parameters. In our tests, we have followed the protocol of the MetaSUB Challenge, which allowed us to compare the obtained results with the solutions based on taxonomic and functional classification.
Conclusions: The results reported in the paper indicate that our method is competitive with those based on taxonomic classification. Importantly, by measuring the similarity at the reads level, we avoid the necessity of using large databases with annotated gene sequences. Hence our main finding is that environmental classification of metagenomic data can be proceeded without using large databases required for taxonomic or functional classification.
Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, Bioinformatics, 2019; 35(22):4791–4793, Abstract.
Availability and implementation: https://github.com/refresh-bio/GTShark.
Deorowicz, S., Gudys, A., Dlugosz, M., Kokot, M., Danek, A., Kmer-db: instant evolutionary distance estimation, Bioinformatics, 2019; 35(1):133–136, Abstract.
Availability and implementation:
https://github.com/refresh-bio/kmer-db.
Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Whisper: read sorting allows robust robust mapping of DNA sequencing data, Bioinformatics, 2019; 35(12):2043–2050, Abstract.
Results: We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline.
Deorowicz, S., Grabowski, Sz., DeltaComp: Fast and efficient compression of astronomical timelines, New Astronomy, 2018; 65():59–66, Abstract.
Roguski, L., Ochoa, I., Hernaez, M., Deorowicz, S., FaStore: a space-saving solution for raw sequencing data, Bioinformatics, 2018; 34(16):2748–2756, Abstract.
Results:
FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance.
Availability and implementation: FaStore can be downloaded from
https://github.com/refresh-bio/FaStore.
Danek, A., Deorowicz, S., GTC: how to maintain huge genotype collections in a compressed form, Bioinformatics, 2018; 34(11):1834–1840, Abstract.
Results:
We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second.
Availability and implementation: GTC can be downloaded from
https://github.com/refresh-bio/GTC or
http://sun.aei.polsl.pl/REFRESH/gtc.
Kokot, M., Deorowicz, S., Długosz, M., Even Faster Sorting of (Not Only) Integers, Chapter in book Man-Machine Interactions 5, Series Advances in Intelligent Systems and Computing, Springer International Publishing Switzerland, Gruca, A., Czachorski, T., Harezlak, K., Kozielski, S., Piotrowska, A., Editors; 2018; 659:481–491, Abstract.
Długosz, M., Deorowicz, S., Kokot, M., Improvements in DNA Reads Correction, Chapter in book Man-Machine Interactions 5, Series Advances in Intelligent Systems and Computing, Springer International Publishing Switzerland, Gruca, A., Czachorski, T., Harezlak, K., Kozielski, S., Piotrowska, A., Editors; 2018; 659:115–124, Abstract.
Kokot, M., Długosz, M., Deorowicz, S., KMC 3: counting and manipulating k -mer statistics, Bioinformatics, 2017; 33(17):2759–2761, Abstract.
Availability: Program is freely available at http://sun.aei.polsl.pl/REFRESH/kmc.
Gudys, A., Deorowicz, S., QuickProbs 2: towards rapid construction of high-quality alignments of large protein families, Scientific Reports, 2017; 7(41553):Abstract.
Maciej Długosz, M., Deorowicz, S., RECKONER: read error corrector based on KMC, Bioinformatics, 2017; 33(7):1086–1089, Abstract.
Availability and Implementation: Program is freely available at http://sun.aei.polsl.pl/REFRESH/reckoner.
Contact: sebastian.deorowicz@polsl.pl
Supplementary information: Supplementary data are available at Bioinformatics online.
Grabowski, Sz., Raniszewski, M., Deorowicz, S., FM-index for Dummies, Chapter in book Beyond Databases, Architectures, and Structures, Series Communications in Computer and Information Science, Springer International Publishing Switzerland, Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2017; 718:189–201, Abstract.
Kokot, M., Deorowicz, S., Debudaj-Grabysz, A., Sorting Data on Ultra-Large Scale with RADULS: New Incarnation of Radix Sort, Chapter in book Beyond Databases, Architectures, and Structures, Series Communications in Computer and Information Science, Springer International Publishing Switzerland, Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2017; 718:235–245, Abstract.
Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., FAMSA: Fast and accurate multiple sequence alignment of huge protein families, Scientific Reports, 2016; 6(33964):Abstract.
Deorowicz, S., Grabowski, Sz., Ochoa, I., Hernaez, M., Weissman, T., Comment on: "ERGC: An efficient referential genome compression algorithm", Bioinformatics, 2016; 32(7):1115–1117, Abstract.
Results: We evaluated ERGC and the previously proposed algorithms GDC and iDoComp,
which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies
of human genome (instead of only four of them in the original paper).
ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case
for only the two Korean genomes).
In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp.
Kabza, M., Kubiak, M.R., Danek, A., Rosikiewicz, W., Deorowicz, S., Polański, A., Makałowska, I., Inter-population Differences in Retrogene Loss and Expression in Humans, PLOS Genetics, 2015; 11(10):e1005579, Abstract.
Kowalski, T., Grabowski, Sz., Deorowicz, S., Indexing arbitrary-length k-mers in sequencing reads, PLOS ONE, 2015; 10(7):1–16, Abstract.
Deorowicz, S., Danek, A., Niemiec, M., GDC 2: Compression of large collections of genomes, Scientific Reports, 2015; 5(11565):1–12, Abstract.
Kawulok, J., Deorowicz, S., CoMeta: Classification of Metagenomes Using k-mers, PLOS ONE, 2015; 10(4):1–23, Abstract.
Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; 31:1569–1576, Abstract.
Results: We present a novel method for k-mer counting, on large
datasets about twice faster than the strongest competitors (Jellyfish
2, KMC 1), using about 12GB (or less) of RAM. Our disk-based
method bears some resemblance to MSPKmerCounter, yet replacing
the original minimizers with signatures (a carefully selected subset
of all minimizers) and using (k; x)-mers allows to significantly reduce
the I/O, and a highly parallel overall architecture allows to achieve
unprecedented processing speeds. For example, KMC 2 counts the
28-mers of a human reads collection with 44-fold coverage (106GB
of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with
an SSD.
Availability: KMC 2 is freely available at http://sun.aei.polsl.pl/kmc.
Grabowski, Sz., Deorowicz, S., Roguski, L., Disk-based compression of data from genome sequencing, Bioinformatics, 2015; 31:1389–1395, Abstract.
Results: We propose ORCOM (Overlapping Reads COmpression with Minimizers), a compression algorithm dedicated to sequencing reads (DNA only).
Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gb dataset into only 5.31GB of space.
Availability: http://sun.aei.polsl.pl/orcom under a free license.
Supplementary data: available at Bioinformatics online.
Danek, A., Deorowicz, S., Grabowski, Sz., Indexes of Large Genome Collections on a PC, PLOS ONE, 2014; 9(10):e109384, Abstract.
Deorowicz, S., Grabowski, Sz., Efficient algorithms for the longest common subsequence in k-length substrings, Information Processing Letters, 2014; 114(11):634–638, Abstract.
Roguski, L., Deorowicz, S., DSRC 2: Industry-oriented compression of FASTQ files, Bioinformatics, 2014; 30(15):2213–2215, Abstract.
Availability and Implementation:
DSRC 2 is freely available at http://sun.aei.polsl.pl/dsrc.
The package contains: command-line compressor, C++ and Python libraries for easy integration with existing software,
technical documentation with examples of usage.
Gudys, A., Deorowicz, S., QuickProbs – A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors, PLOS ONE, 2014; 9(2):e88901, Abstract.
Wieczorek, B., Polomski, M., Pecka, P., Deorowicz, S., An Effective Way of Storing and Accessing Very Large Transition Matrices Using Multi-core CPU and GPU Architectures, Chapter in book Beyond Databases, Architectures, and Structures, Series Communications in Computer and Information Science, Springer International Publishing Switzerland, Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2014; 424:323–334, Abstract.
Kawulok, J., Deorowicz, S., An Improved Algorithm for Fast and Accurate Classification of Sequences, Chapter in book Beyond Databases, Architectures, and Structures, Series Communications in Computer and Information Science, Springer International Publishing Switzerland, Kozielski, S., Mrozek, D., Kasprowski, P., Malysiak-Mrozek, B., Kostrzewa, D., Editors; 2014; 424:335–344, Abstract.
Danek, A., Deorowicz, S., Bit-Parallel Algorithm for the Block Variant of the Merged Longest Common Subsequence Problem, Chapter in book Man-Machine Interactions 3, Series Advances in Intelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:173–181, Abstract.
Deorowicz, S., Debudaj-Grabysz, A., Gudyś, A., Kalign-LCS – A More Accurate and Faster Variant of Kalign2 Algorithm for the Multiple Sequence Alignment Problem, Chapter in book Man-Machine Interactions 3, Series Advances in Intelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:495–502, Abstract.
Deorowicz, S., Grabowski, Sz., Subcubic Algorithms for the Sequence Excluded LCS Problem, Chapter in book Man-Machine Interactions 3, Series Advances in Intelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:503–510, Abstract.
Susik, R., Grabowski, Sz., Deorowicz, S., Fast and Simple Circular Pattern Matching, Chapter in book Man-Machine Interactions 3, Series Advances in Intelligent Systems and Computing, Springer-Verlag Berlin Heidelberg, Gruca, A., Czachórski, T., Kozielski, S., Editors; 2014; 242:537–544, Abstract.
Deorowicz, S., Danek, A., Bit-Parallel Algorithms for the Merged Longest Common Subsequence Problem, International Journal of Foundations of Computer Science, 2013; 24(8):1281–1298, Abstract.
Deorowicz, S., Grabowski, Sz., Data compression for sequencing data, Algorithms for Molecular Biologgy, 2013; 8():Article no. 25, Abstract.
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, Sz., Disk-based k-mer counting on a PC, BMC Bioinformatics, 2013; 14():Article no. 160, Abstract.
Results
We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.
Conclusions
By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.
Deorowicz, S., Danek, A., Grabowski, Sz., Genome compression: a novel approach for large collections, Bioinformatics, 2013; 29(20):2572–2578, Abstract.
Results:
We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ?400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions.
Availability: http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license.
Roguski, L., Deorowicz, S., Fluid motion modelling using vortex particle method on GPU, Studia Informatica, 2013; 34(1 (110)):31–48, Abstract.
Szczesniak, M.W., Deorowicz, S., Gapski, J., Kaczynski, L., Makalowska, I., miRNEST database: an integrative approach in microRNA search and annotation, Nucleic Acids Research, 2012; 40(D1):D198–D204, Abstract.
Deorowicz, S., Quadratic-time algorithm for a string constrained LCS problem, Information Processing Letters, 2012; 112(11):423–426, Abstract.
Gudys, A., Deorowicz, S., A Parallel Algorithm for the Constrained Multiple Sequence Alignment Problem Designed for GPUs, International Journal of Foundations of Computer Science, 2012; 23(4):877–902, Abstract.
Deorowicz, S., A Cover-Merging-Based Algorithm for the Longest Increasing Subsequence in a Sliding Window Problem, Computing and Informatics, 2012; 31(6):1217–1233, Abstract.
Deorowicz, S., Grabowski, Sz., Robust relative compression of genomes with random access, Bioinformatics, 2011; 27(21):2979–2986, Abstract.
Results: We present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over an order of magnitude greater. In particular, 69 differentially encoded human genomes are compressed over 400 times at fast compression, or even 1000 times at slower compression (the reference genome itself needs much more space). Adding fast random access to text snippets decreases the ratio to ~300.
Availability: GDC is available at http://sun.aei.polsl.pl/gdc.
Deorowicz, S., Grabowski, Sz., Compression of DNA sequences in FASTQ format, Bioinformatics, 2011; 27(6):860–862, Abstract.
Results: We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org).
Availability: DSRC is freely available at http://sun.aei.polsl.pl/dsrc.
Gudyś, A., Deorowicz, S., A Parallel GPU-Designed Algorithm for the Constrained Multiple Sequence Alignment Problem, Chapter in book Man-Machine Interactions 2, Series Advances in Intelligent and Soft Computing, Springer-Verlag Berlin Heidelberg, Czachórski, T., Kozielski, S., Stańczyk, U., Editors; 2011; 103:361–368, Abstract.
Pecka, P., Deorowicz, S., Nowak, M., Efficient Representation of Transition Matrix in the Markov Process Modeling of Computer Networks, Chapter in book Man-Machine Interactions 2, Series Advances in Intelligent and Soft Computing, Springer-Verlag Berlin Heidelberg, Czachórski, T., Kozielski, S., Stańczyk, U., Editors; 2011; 103:457–464, Abstract.
Deorowicz, S., Solving Longest Common Subsequence and Related Problems on Graphical Processing Units, Software–Practice and Experience, 2010; 40(8):673–700, Abstract.
Deorowicz, S., Bit-Parallel Algorithm for the Constrained Longest Common Subsequence Problem, Fundamenta Informaticae, 2010; 99(4):409–433, Abstract.
Deorowicz, S., Obstój, J., Constrained Longest Common Subsequence Computing Algorithms in Practice, Computing and Informatics, 2010; 29(3):729–744, Abstract.
Deorowicz, S., On Some Variants of the Longest Increasing Subsequence Problem, Theoretical and Applied Informatics, 2009; 21(3–4):135–148, Abstract.
Deorowicz, S., Grabowski, Sz., A Hybrid Algorithm for the Longest Common Transposition-Invariant Subsequence Problem, Computing and Informatics, 2009; 28(5):729–744, Abstract.
Deorowicz, S., Grabowski, Sz., On Two Variants of the Longest Increasing Subsequence Problem, Chapter in book Man-Machine Interactions, Series Advances in Intelligent and Soft Computing, Springer-Verlag Berlin Heidelberg, Cyran, K.A., Kozielski, S., Peters J.F., Stańczyk, U., Wakulicz-Deja, A., Editors; 2009; 59:541–549, Abstract.
Deorowicz, S., Computing the longest common transposition-invariant subsequence with GPU, Chapter in book Man-Machine Interactions, Series Advances in Intelligent and Soft Computing, Springer-Verlag Berlin Heidelberg, Cyran, K.A., Kozielski, S., Peters J.F., Stańczyk, U., Wakulicz-Deja, A., Editors; 2009; 59:551–559, Abstract.
Deorowicz, S., An algorithm for solving the longest increasing circular subsequence problem, Information Processing Letters, 2009; 109(12):630–634, Abstract.
Deorowicz, S., Algorithms solving the constrained longest common subsequence problem (in Polish), In Proceedings of SWD, 2008; 295–303, Abstract.
Grabowski Sz., Deorowicz, S., Efficient preprocessing for web log compression, International Scientific Journal of Computing, 2008; 7(1):35–42, Abstract.
Deorowicz, S., Grabowski Sz., Sub-atomic field processing for improved web log compression, In Proceedings of 9th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET'2008), 2008; 551–556, Abstract.
Full text in pdf
Grabowski Sz., Deorowicz, S., Nice To Be A Chimera: A Hybrid Algorithm For The Longest Common Transposition-Invariant Subsequence Problem, In Proceedings of 9th International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science (TCSET'2008), 2008; 50–54, Abstract.
Full text in pdf
Deorowicz, S., Fast algorithm for the constrained longest common subsequence problem, Theoretical and Applied Informatics, 2007; 19(2):91–102, Abstract.
Full text in pdf
Grabowski Sz., Deorowicz, S., Web log compression, AGH Automatyka, 2007; 11(3):417–424, Abstract.
Full text in pdf
Deorowicz, S., Speeding up Transposition-Invariant String Matching, Information Processing Letters, 2006; 100(1):14–20, Abstract.
Full text in pdf
Deorowicz, S., Ciura, M., Correcting spelling errors by modelling their causes, International Journal of Applied Mathematics and Computer Science, 2005; 15(2):275–285, Abstract.
Full text in pdf
Substitution rules used in the paper (txt)
Wikipedia data set used in the work (txt)
Aspell data set used in the work (txt)
Deorowicz, S., Context exhumation after the Burrows–Wheeler transform, Information Processing Letters, 2005; 95(1):313–320, Abstract.
Skibiński, P., Grabowski, Sz., Deorowicz, S., Revisiting dictionary-based compression, Software–Practice and Experience, 2005; 35(15):1455–1476, Abstract.
Full text in pdf
WRT source code and Win executable
WRT-LZ source code and Win executable
Deorowicz, S., Second step algorithms in the Burrows–Wheeler compression algorithm, Software–Practice and Experience, 2002; 32(2):99–111, Abstract.
Full text in pdf
Ciura, M.G., Deorowicz, S., How to squeeze a lexicon, Software–Practice and Experience, 2001; 31(11):1077–1090, Abstract.
Full text in PostScript
Source code in ANSI C
Deorowicz, S., Improvements to Burrows–Wheeler compression algorithm, Software–Practice and Experience, 2000; 30(13):1465–1483, Abstract.
Full text in PostScript
Deorowicz, S., Solving the Delivery Problem Using k-Optimal Algorithms (in Polish), Zeszyty Naukowe Politechniki Śląskiej, Seria Informatyka, 1997; 33(1):139–146, Abstract.
Czech, Z.J., Fabian, P., Deorowicz, S., Algorytmy i struktury danych. Wybrane zagadnienia (in Polish), Wydawnictwo Politechniki Śląskiej, 2007; Abstract.
Deorowicz, S., Sequential and parallel algorithms for subsequence finding, Silesian University of Technology, 2011; Abstract.
Deorowicz, S., Universal lossless data compression algorithms, Silesian University of Technology, 2003; Abstract.
Full text in pdf
Errata in pdf
Deorowicz, S., Heurystyczne i genetyczne algorytmy rozwiązywania problemu dostawy (in Polish), Silesian University of Technology, 1998; Abstract.
Deorowicz, S., Obstój, J., Constrained Longest Common Subsequence Computing Algorithms in Practice, 2009; Abstract.
Full text in pdf
Deorowicz, S., Fast algorithm for constrained longest common subsequence problem, 2005; Abstract.
Full text in pdf
Deorowicz, S., Speeding up Transposition-Invariant String Matching, 2005; Abstract.
Full text in pdf
Navarro, G., Grabowski, Sz., Mäkinen, V., Deorowicz, S., Improved Time and Space Complexities for Transposition Invariant String Matching, 2004; Abstract.
Full text in pdf