To repeat the experiments in the paper please do the following steps. ************************************************************************** 1. Downloading the datasets. Please download and unpack the following files: a) E. coli ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF_R1.fastq.gz ftp://webdata:webdata@ussd-ftp.illumina.com/Data/SequencingRuns/MG1655/MiSeq_Ecoli_MG1655_110721_PF_R2.fastq.gz b) Reads from the human genome http://crac.inria.fr/reads/GRCh37-mutated-75-simulated-errors-42M.7za c) C. elegans http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065390/SRR065390_1.fastq.gz http://ftp.sra.ebi.ac.uk/vol1/fastq/SRR065/SRR065390/SRR065390_2.fastq.gz ************************************************************************** 2. Downloading and installation of the examined programs. 2.1. PgSA a) Download pgsa.tar.gz from http://sun.aei.polsl.pl/REFRESH/pgsa/downloads/1.0/pgsa.tar.gz b) Unpack the archive c) Compile it using: make all d) Path to programs used in experiments: dist/Release-PgSAgen/GNU-Linux-x86/PgSAgen dist/Release-PgSAtest/GNU-Linux-x86/PgSAtest 2.2. CGkA (benchmark) a) Download cgka-test.tar.gz from http://sun.aei.polsl.pl/REFRESH/pgsa/downloads/cgka-test.tar.gz b) Unpack the archive c) Compile it using: make d) Path to programs used in experiments: ./builder ./cgk-test 2.3. GkArrays (benchmark) a) Download libGkArrays-2.1.0.tar.gz from https://gforge.inria.fr/frs/download.php/33249/libGkArrays-2.1.0.tar.gz b) Install the library on your machine according to steps in 'Installation from the source code' section. c) Download gka-test.tar.gz from http://sun.aei.polsl.pl/REFRESH/pgsa/downloads/gka-test.tar.gz d) Unpack the archive e) Compile it using: make f) Path to program used in experiments: ./bin/release/gk-test 2.4. CRAC a) Download CRAC 1.3.2 source codes from http://crac.gforge.inria.fr/index.php?post=crac-132-released b) Follow the instruction from http://crac.gforge.inria.fr/index.php?id=start to compile c) For experiments download CRAC index from http://crac.gforge.inria.fr/index.php?id=genomes-reads Human genome (GRCh37) : GRCh37.ssa (1.9 GiB) – GRCh37.conf (295 B) 2.5. CRAC with PgSA a) Download and extract CRAC 1.3.2 as pointed above. b) Download pgsa_patch_for_crac1.3.2.tar.gz from http://sun.aei.polsl.pl/REFRESH/pgsa/downloads/pgsa_patch_for_crac1.3.2.tar.gz c) Extract pgsa_patch_for_crac1.3.2.tar.gz into CRAC/s directory overwriting some CRAC's files. d) To compile use: ./configure make ************************************************************************** 3. Performing experiments 3.1. Preprocessing the datasets (with Python 2.x scripts) (i.e. conversion from fasta and fastq to EOL-terminated reads sequence; EOL = chr(0x0A)) a) Download fast2dna.tar.gz from http://sun.aei.polsl.pl/REFRESH/pgsa/downloads/fast2dna.tar.gz b) Path to programs used for conversion: ./fastq2dna/fastq2dna.py ./fasta2dna/fasta2dna.py c) Join E. coli datasets into one EOL-terminated reads sequence cp MiSeq_Ecoli_MG1655_110721_PF_R1.fastq ecoli_r.fastq cat MiSeq_Ecoli_MG1655_110721_PF_R2.fastq >> ecoli_r.fastq python fastq2dna.py ecoli_r.fastq ecoli.reads d) Convert reads from the human genome into EOL-terminated reads sequence python fasta2dna.py GRCh37-mutated-75-simulated-errors-42M.fasta human.reads e) Join C. elegans datasets into one EOL-terminated reads sequence cp SRR065390_1.fastq elegans.fastq cat SRR065390_2.fastq >> elegans.fastq python fastq2dna.py elegans.fastq elegans.reads 3.2. PgSA a) Generate PgSA indexes using 'PgSAgen' program: ./PgSAgen ecoli.reads ecoli ./PgSAgen -r 4 ecoli.pgsa ecoli_s4 ./PgSAgen -r 6 ecoli.pgsa ecoli_s6 ./PgSAgen human.reads human ./PgSAgen -r 4 human.pgsa human_s4 ./PgSAgen -r 6 human.pgsa human_s6 (NOTE: C. elegans was tested with cache for queries Q2, Q4, and Q6) ./PgSAgen -c elegans.reads elegans ./PgSAgen -r 2 elegans.pgsa elegans_s2 ./PgSAgen -r 3 elegans.pgsa elegans_s3 ./PgSAgen -r 4 elegans.pgsa elegans_s4 ./PgSAgen -r 5 elegans.pgsa elegans_s5 ./PgSAgen -r 6 elegans.pgsa elegans_s6 b) Run benchmarks using 'PgSAtest' program: (NOTE: queries by position (-p switch) -f switch is used for compatibility with CGkA due to a problem with supporting TTTT..TTT queries) ./PgSAtest -p -f ecoli.pgsa ./PgSAtest -p -f ecoli_s4.pgsa ./PgSAtest -p -f ecoli_s6.pgsa ./PgSAtest -p -f human.pgsa ./PgSAtest -p -f human_s4.pgsa ./PgSAtest -p -f human_s6.pgsa (NOTE: queries by sequence) ./PgSAtest -f ecoli.pgsa ./PgSAtest -f ecoli_s4.pgsa ./PgSAtest -f ecoli_s6.pgsa ./PgSAtest -f human.pgsa ./PgSAtest -f human_s4.pgsa ./PgSAtest -f human_s6.pgsa (NOTE: queries for C. elegans by sequence using cache (-c switch)) ./PgSAtest -c elegans.pgc elegans.pgsa ./PgSAtest -c elegans.pgc elegans_s2.pgsa ./PgSAtest -c elegans.pgc elegans_s3.pgsa ./PgSAtest -c elegans.pgc elegans_s4.pgsa ./PgSAtest -c elegans.pgc elegans_s5.pgsa ./PgSAtest -c elegans.pgc elegans_s6.pgsa 3.3. CGkA a) Generate CGkA indexes using 'builder' program: ./builder -v -k 11 -s 6 ecoli.reads ecoli_s6_k11 ./builder -v -k 16 -s 6 ecoli.reads ecoli_s6_k16 ./builder -v -k 22 -s 6 ecoli.reads ecoli_s6_k22 ./builder -v -k 11 -s 25 ecoli.reads ecoli_s25_k11 ./builder -v -k 16 -s 25 ecoli.reads ecoli_s25_k16 ./builder -v -k 22 -s 25 ecoli.reads ecoli_s25_k22 ./builder -v -k 11 -s 512 ecoli.reads ecoli_s512_k11 ./builder -v -k 16 -s 512 ecoli.reads ecoli_s512_k16 ./builder -v -k 22 -s 512 ecoli.reads ecoli_s512_k22 ./builder -v -k 11 -s 6 human.reads human_s6_k11 ./builder -v -k 16 -s 6 human.reads human_s6_k16 ./builder -v -k 22 -s 6 human.reads human_s6_k22 ./builder -v -k 11 -s 22 human.reads human_s22_k11 ./builder -v -k 16 -s 22 human.reads human_s22_k16 ./builder -v -k 22 -s 22 human.reads human_s22_k22 ./builder -v -k 11 -s 512 human.reads human_s512_k11 ./builder -v -k 16 -s 512 human.reads human_s512_k16 ./builder -v -k 22 -s 512 human.reads human_s512_k22 b) Run benchmarks using 'cgk-test' program: ./cgk-test -k 11,16,22 -l 151 ecoli.reads ecoli_s6_k ./cgk-test -k 11,16,22 -l 151 ecoli.reads ecoli_s25_k ./cgk-test -k 11,16,22 -l 151 ecoli.reads ecoli_s512_k ./cgk-test -k 11,16,22 -l 151 human.reads human_s6_k ./cgk-test -k 11,16,22 -l 151 human.reads human_s22_k ./cgk-test -k 11,16,22 -l 151 human.reads human_s512_k 3.4. GkArrays a) Preparation of data. Join E. coli fastq datasets into a single file: cp MiSeq_Ecoli_MG1655_110721_PF_R1.fastq ecoli_r.fastq cat MiSeq_Ecoli_MG1655_110721_PF_R2.fastq >> ecoli_r.fastq b) Run benchmarks using 'gk-test' program: ./gk-test 75 GRCh37-mutated-75-simulated-errors-42M.fasta ./gk-test 151 ecoli_r.fastq 3.5. CRAC a) Run CRAC: ./crac -i GRCh37 -k 22 -r GRCh37-mutated-75-simulated-errors-42M.fasta -m 75 --sam GRCh37k22.sam --nb-threads 1 3.6. CRAC with PgSA a) Generate PgSA index: ./PgSAgen GRCh37-mutated-75-simulated-errors-42M.fasta human ./PgSAgen -r 4 human.pgsa human_s4 ./PgSAgen -r 6 human.pgsa human_s6 b) Run CRAC with PgSA: ./crac -i GRCh37 -p human.pgsa -k 22 -r GRCh37-mutated-75-simulated-errors-42M.fasta -m 75 --sam GRCh37PgSAk22.sam c) To run CRAC with PgSA for other SA sparsities follow: ./PgSAgen -r 4 human.pgsa human_s4 ./crac -i GRCh37 -p human_s4.pgsa -k 22 -r GRCh37-mutated-75-simulated-errors-42M.fasta -m 75 --sam GRCh37PgSAk22.sam ./PgSAgen -r 6 human.pgsa human_s6 ./crac -i GRCh37 -p human_s6.pgsa -k 22 -r GRCh37-mutated-75-simulated-errors-42M.fasta -m 75 --sam GRCh37PgSAk22.sam