REFRESH Bioinformatics Group


KMC—What is it?

KMC—K-mer Counter is a utility designed for counting k-mers (sequences of consecutive k symbols) in a set of reads from genome sequencing projects. K-mer counting is important for many bioinformatics applications, e.g., developing de Bruijn graph assemblers. Building de Bruijn graphs is a commonly used approach for genome assembly with data from second-generation sequencer. Unfortunately, sequencing errors (frequent in practice) results in huge memory requirements for de Bruijn graphs, as well as long build time. One of the popular approaches to handle this problem is filtering the input reads in such a way that unique k-mers (very likely obtained as a result of an error) are discarded.

Thus, KMC scans the raw reads and produces a compact representation of all non-unique reads accompanied with number of their occurrences. The algorithm implemented in KMC makes use mostly of disk space rather than RAM, which allows to use KMC even on rather typical personal computers. When run at high-end server (what is necessary for KMC competitors) it outperforms them in both memory requirements and speed of computation. The disk space necessary for computation is in order of the size of input data (usually it is smaller).

The architecture of KMC

KMC is designed as a C++ library that can be used by various applications. This class was used to provide also an executable for Linux and Windows systems. There is also some API to access a database of counter k-mers.

How good is KMC?

Experiments show that it offers both faster and much more memory frugal solution to the k-mer counting problem than recently proposed methods. In particular, it is capable to count the statistics for human genome data, in input gzipped FASTQ file over 106 GB in size (44-fold coverage), in about 20 minutes on a PC with 6-core CPU and SSD.

KMC tools

KMC tools is a package that allows to perform a number of operations on databases of k-mers. Some examples are: set operations, filtering, new dump. More details can be found in the documentation of KMC tools.


+ Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; doi: 10.1093/bioinformatics/btv022():Abstract.

Motivation: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory.
Results: We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k; x)-mers allows to significantly reduce the I/O, and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106GB of compressed size) in about 20 minutes, on a 6-core Intel i7 PC with an SSD.
Availability: KMC 2 is freely available at

+ Deorowicz, S., Debudaj-Grabysz, A., Grabowski, Sz., Disk-based k-mer counting on a PC, BMC Bioinformatics, 2013; 14():Article no. 160, Abstract.

Background The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.
Results We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.
Conclusions By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.