Bisulfite alignment benchmark

This benchmark contains simulated DNA reads, which were used to test several methods for aligning bisulfite-converted DNA.

The data

The DNA reads in fastq-sanger format:

The true alignments of the reads to the genome, in seg format:

The genome in fasta format:

In the fastq title lines, the first number is the read's unique name/serial-number, which is cross-referenced in the seg files. You can ignore everything else in the title lines.

How it was created

The datasets were made with DNemulator. Dataset A was made as follows:

fasta-methyl-sim hg19-female/chr*.fa > meth.fa

fasta-polymorph -a poly.seg snp132Common.txt meth.fa meth.fa > poly.fa

fasta-random-chunks -n 1000000 -s 87 poly.fa |
fasta-bisulf-sim > dataset-a.fasta

fastq-sim dataset-a.fasta SRR019072.fastq > dataset-a.fastq

Dataset B was made in the same way, but using -s 85 instead of -s 87 and SRR094461 instead of SRR019072.

The read-to-genome alignments were obtained using seg-suite, like this:

seg-swap poly.seg | seg-sort > p.seg

awk '/>/ {print $3, $4, $5, $2, $6 == "+" ? 0 : -$3}' dataset-a.fasta |
seg-sort > a.seg

seg-join p.seg a.seg | cut -f1,4-7 | seg-sort > dataset-a.seg

Data sources

The genome and snp132Common.txt were obtained from the UCSC Genome Database. SRR019072 and SRR094461were obtained from the NCBI SRA, using their fastq-dump tool.

Reference

A mostly traditional approach improves alignment of bisulfite-converted DNA. MC Frith, R Mori, K Asai. Nucleic Acids Research 2012 40(13):e100.