KMC—K-mer Counter is a utility designed for counting k-mers (sequences of consecutive k symbols) in a set of reads from genome sequencing projects. K-mer counting is important for many bioinformatics applications, e.g., developing de Bruijn graph assemblers. Building de Bruijn graphs is a commonly used approach for genome assembly with data from second-generation sequencer. Unfortunately, sequencing errors (frequent in practice) results in huge memory requirements for de Bruijn graphs, as well as long build time. One of the popular approaches to handle this problem is filtering the input reads in such a way that unique k-mers (very likely obtained as a result of an error) are discarded.
Thus, KMC scans the raw reads and produces a compact representation of all non-unique reads accompanied with number of their occurrences. The algorithm implemented in KMC makes use mostly of disk space rather than RAM, which allows to use KMC even on rather typical personal computers. When run at high-end server (what is necessary for KMC competitors) it outperforms them in both memory requirements and speed of computation. The disk space necessary for computation is in order of the size of input data (usually it is smaller).
KMC is designed as a C++ library that can be used by various applications. This class was used to provide also an executable for Linux and Windows systems. There is also some API to access a database of counter k-mers.
Experiments show that it offers both faster and much more memory frugal solution to the k-mer counting problem than recently proposed methods. In particular, it is capable to count the statistics for human genome data, in input gzipped FASTQ file over 106 GB in size (44-fold coverage), in about 20 minutes on a PC with 6-core CPU and SSD.
KMC tools is a package that allows to perform a number of operations on databases of k-mers. Some examples are: set operations, filtering, new dump. More details can be found in the documentation of KMC tools.
Results: We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k; x)-mers allows to significantly reduce the I/O, and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106GB of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with an SSD.
Availability: KMC 2 is freely available at http://sun.aei.polsl.pl/kmc.
Results We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.
Conclusions By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.