-Kontakt -Materiały do zajęć -Materiały arch. -Dyplomy

Silesia compression corpus

Standard lossless data compression corpora

Three well-known data sets are used by researchers in the universal lossless data compression field. The first one, the Calgary corpus, was introduced in 1989 by Bell et al.. The files in the corpus were chosen to cover up the typical types of data used in computer processing. This corpus is rather old, and it contains some types of data which went out of use, but the corpus is still a good benchmark used by many authors. The corpora proposed later, the Canterbury corpus and the large Canterbury corpus, contain files of similar sizes, less than 5 MB.

Over the years of using of these corpora some observations have proven their important disadvantages. The most important in our opinion are:

Silesia corpus

The intention of the Silesia corpus is to provide a data set of files that covers the typical data types used nowadays. The sizes of the files are between 6 MB and 51 MB.

The chosen files are of different types and come from several sources. In our opinion, nowadays the two fastest growing types of data are multimedia and databases. The former are typically compressed with lossy methods so we do not include them in the corpus. The database files, osdb, sao, nci, come from three different fields. The first one is a sample database from an open source project that is intended to be used as a standard, free database benchmark. The second one, sao, is one of the astronomical star catalogues. This is a binary database composed of records of complex structure. The last one, nci, is a part of the chemical database of structures.

The sizes of computer programs are also growing rapidly. The standard corpora include only single, small routines, both in source and object code. Today it is almost impractical to compress every single source code file separately. The projects are composed of hundreds or thousands files, so it is a common habit to compress it all together. We often can achieve a better compression ratio if we compress a concatenated file of similar contents than the small separate ones. This trend is reflected in including a samba file. Besides the source codes, there is also a need to store the executables. We decided to include two files: ooffice and mozilla. The first one is a single medium-sized executable for the Windows system. The second is a concatenation of the whole application for Tru64 Unix system composed of executables, archives, texts, HTML files, and other.

There are also types of images that cannot be compressed loosely—the medical images. The sizes of such files are also huge and we include two examples of them in the corpus. The first file, x-ray, is an X-ray picture of a child's hand. The second file, mr, is a magnetic resonance, three dimensional image of a head.

The standard corpora contain text files. Moreover, these files are typically the largest files of them, but in our opinion there is a need to test the compression efficiency also on the larger ones stored in different file types. We propose three such files. The first, dickens, is a collection of some works by Charles Dickens that can be found in the Project Gutenberg. This is a plain text file. The second one, reymont, is a book Chłopi by Władysław Reymont stored in a PDF file. The PDF files can be internally-compressed but the quality of this build-in compression is rather poor, and much better results can be obtained when we compress an uncompressed PDF file. Because of this we enclose the uncompressed version. The last text file, webster, is an electronic version of The 1913 Webster Unabridged Dictionary taken from the Project Gutenberg. The file is stored in the HTML format. The last file of the new corpus, xml, is a concatenation of 21 XML files. The XML standard is designed to be a universal file format for storing documents, so we decided to enclose it.

Silesia Corpus contents

Filename Description Type Source Raw size [B] Bzipped size [B]
dickens Collected works of Charles Dickens English text Project Gutenberg 10,192,446 2,799,528
mozilla Tarred executables of Mozilla 1.0 (Tru64 UNIX edition) exe Mozilla Project 51,220,480 17,914,392
mr Medical magnetic resonanse image picture Hospital image 9,970,564 2,441,280
nci Chemical database of structures database CACTVS Chemical Information Services at LMC/NCI 33,553,445 1,812,734
ooffice A dll from Open Office.org 1.01 exe Open Office 6,152,192 2,862,526
osdb Sample database in MySQL format from Open Source Database Benchmark database Open Source Database Benchmark Project 10,085,684 2,802,792
reymont Text of the book Chłopi by Władysław Reymont Polish pdf Virtual Library of Polish Literature 6,627,202 1,246,230
samba Tarred source code of Samba 2-2.3 src Samba Project 21,606,400 4,549,790
sao The SAO star catalog bin data Astronomical Catalogs and Catalog Formats 7,251,944 4,940,524
webster The 1913 Webster Unabridged Dictionary html Project Gutenberg 41,458,703 8,644,714
xml Collected XML files html XMLPPM: XML-Conscious PPM Compression 5,345,280 441,186
x-ray X-ray medical picture Hospital image 8,474,240 4,051,112
Total 211,938,580 54,506,808

Silesia Corpus details


Charles Dickens wrote many novels. The file is a concatenation of some, fourteen, of his works that can be found in the Project Gutenberg (A Child's History Of England, All The Year Round: Contributions, American Notes, The Battle Of Life, Bleak House, A Christmas Carol, David Copperfield, Dombey And Son, Doctor Marigold, Going Into Society, George Silverman's Explanation, Barnaby Rudge: a tale of the Riots of 'eighty, The Chimes, The Cricket On The Hearth). The file is a simple text.

MD5: 88334708559f6db57d79096bc0aca07e


A Mozilla 1.0 open source web browser was installed on the Tru64 UNIX operating system and then the contents of the Mozilla.org directory were tarred. There are 525 files of such types as: executables, jar archives, HTML, XML, text, and others.

MD5: c7789a2097f1ff944b0c737430a339b3


A magnetic resonanse medical picture of a head. This file is stored is DICOM format and contains 19 planes.

MD5: 38e623e3093b7bf2003ca4b1bbc19927


The chemical databases of strucures contain information of structures, their components, 2D and/or 3D coordinates, properites, etc. The file is a part of the August 2000 2D File stored in an SDF format which is common file format developped to handle a list of molecular structures associated with properties. The original database is of size 982MB so we had to truncate it to be suitable for a part of the corpus. The 32MB piece (rounded down to the nearest end of the record), we have chosen, is taken from the middle of the original file (starting at the first record after leaving 400MB of data).

MD5: 31f85bc8706f3c921104e7c169e2e2e1


An Open Office is an open source project, which is composed of the word processor, spreadsheet program, presentation maker, and graphical program. The file is a dynamic linked linbrary from version 1.01.

MD5: 573c4ae915e36631d8f2dcffb9b9b66d


An Open Source Database Benchmark is a project invented to provide a free test for database systems. One of the parts of the project are sample databases. The 40MB benchmark was run on the MySQL 3.23 server. The file is one of the MySQL database files, hundred.med.

MD5: e734b0c48e6a982adfb5802da3032ecd


A book Chłopi by Władysław Reymont was honoured the Nobel Price in 1924. The text of the book was taken from the Virtual Library of Polish Literature. Then it was converted to the LaTeX files from which the uncompressed PDF file was produced. The file is uncompressed due to the fact the built-in compression in PDF format is rather poor, and much better results can be obtained when we compress the uncompressed PDF files.

MD5: d8f54d78105079775f32d76dc55fc671


Samba is an open souce project that is intended to be a free alternative to the SMB/CIFS clients. The file contains tarred source code (also documentation, graphics) of the Samba 2.2-3 version.

MD5: 154eaea7ea70e89f6339ff0abf4112ca


There are many star catalogs containing the data of sky objects. The chosen one, SAO catalog, is suitable especially for the amator astronomers. It contains the information about 258,996 stars, and is composed of binary records.

MD5: 79e95a22e18cd82b7e42bf91b380d30b


The 1913 Webster Unabridged Dictionary is an English dictionary stored in a rather simple HTML. The file is a concatenation of files that can be obtained from Project Gutenberg.

MD5: 474931ad907ac27bf962c75ded46c069


The XML is an incomming standard of document format. The importance of XML is still growing. The file is a corpus prepared for XMLPPM: XML-Conscious PPM Compression. This is a concatenation of all, eleven, files.

MD5: 9b09c0c80104adb8aae910b7d7db003e


An X-ray medical picture of child's hand. This is a 12-bit gray scaled image.

MD5: 9baec32ad14ec3eff487d254382cb91c