This benchmark contains simulated DNA reads, which were used to test several methods for aligning bisulfite-converted DNA.
The DNA reads in fastq-sanger format:
The true alignments of the reads to the genome, in seg format:
The genome in fasta format:
In the fastq title lines, the first number is the read's unique name/serial-number, which is cross-referenced in the seg files. You can ignore everything else in the title lines.
The datasets were made with DNemulator. Dataset A was made as follows:
fasta-methyl-sim hg19-female/chr*.fa > meth.fa fasta-polymorph -a poly.seg snp132Common.txt meth.fa meth.fa > poly.fa fasta-random-chunks -n 1000000 -s 87 poly.fa | fasta-bisulf-sim > dataset-a.fasta fastq-sim dataset-a.fasta SRR019072.fastq > dataset-a.fastq
Dataset B was made in the same way, but using -s 85 instead of -s 87 and SRR094461 instead of SRR019072.
The read-to-genome alignments were obtained using seg-suite, like this:
seg-swap poly.seg | seg-sort > p.seg
awk '/>/ {print $3, $4, $5, $2, $6 == "+" ? 0 : -$3}' dataset-a.fasta |
seg-sort > a.seg
seg-join p.seg a.seg | cut -f1,4-7 | seg-sort > dataset-a.seg
The genome and snp132Common.txt were obtained from the UCSC Genome Database. SRR019072 and SRR094461were obtained from the NCBI SRA, using their fastq-dump tool.
A mostly traditional approach improves alignment of bisulfite-converted DNA. MC Frith, R Mori, K Asai. Nucleic Acids Research 2012 40(13):e100.