tantan

tantan is a tool to mask simple regions (low complexity and short-period tandem repeats) in DNA, RNA, and protein sequences.

Get the latest version: tantan-13.zip

The aim of tantan is to prevent false predictions when searching for homologous regions between two sequences. Simple repeats often align strongly to each other, causing false homology predictions.

There are many older repeat-masking tools, so why make yet another one? Because the standard tools do not always work.

For example, we masked the genomes of Caenorhabditis elegans and Pristionchus pacificus with DustMasker, and then searched for similar regions between the C. elegans sequence and the reversed P. pacificus sequence. We found many strong similarities, for instance:

caagaactgaagcaccagagactgtaactacaaggactgagccaccaaggactgaacctccaaggactatacctccgaggactgaagctccaaggactga
|| | ||  ||| ||||  |||     || || ||||  ||  |||| ||||  ||   ||| |||| |   ||||  ||||  | |  ||| ||||  |
catggaccaaagtaccatggaccaaggctccatggaccaaggtaccatggaccaaagtaccatggaccaaggctccatggaccaaggtaccatggaccaa

accaccacgaactgaaccaccaaagaccgaagctccaagaactgtgcggccaaagaccgaagctccaatgactgtaccaccacggactgaaccgccaatg
|  |||| | ||  |   ||||  ||||   |||||| | ||       |||  |||| |||  |||  |||     | ||| ||||  |  | |||  |
agtaccatggaccaaggtaccatggaccatggctccatggaccaaagtaccatggaccaaagtaccatggaccaaggctccatggaccaaggctccatgg

actgaagctccaagaactgaagtgccaatgactgaaccaccaaagactgaaccaccaaggacggcgccaccgagaactgaagtatcaatgactttaccac
||  |||  ||| | ||  |||| |||  |||  |  | |||  |||  ||  |||| ||||   |  |||  | ||  ||||| ||  |||     | |
accaaagtaccatggaccaaagtaccatggaccaaggctccatggaccaaagtaccatggaccaaggtaccatggaccaaagtaccatggaccaaggctc

cagagactgtacctccaaatactgaagctccaaggactgaagtaccaatgactgtaccaccacggactgaaccaccaaagaccgaagctccaaggactgt
||  |||   |   |||   ||  |||  ||| ||||  | |  |||  |||       ||| ||||  ||  ||||  |||| |||  ||| ||||   
catggaccaaagtaccatggaccaaagtaccatggaccaaggctccatggaccaaggttccatggaccaaagtaccatggaccaaagtaccatggaccaa

gccgccaaagaccgaagctccaatgactgaagtaccaatgactggaccttccaggactgaagtaccaatgactgaaccaccaaagactgaacaaccaagg
| | |||  |||| |||  |||  |||  | ||||||  |||   |  | |  ||||  ||||||||  |||| |  |||||  |||  |    ||| ||
ggctccatggaccaaagtaccatggaccaaggtaccatggaccatagttccatggaccaaagtaccatggactaaggcaccatggaccaaggctccatgg

acggcgccaccaagaactgaagtatcaatgactttaccaccagagactgtacctccaaagactgaagctccaaggac                       
||   |   ||| | |   ||||| ||  |||   |  ||||  |||     ||||| ||||  | |  ||| ||||                       
accaaggtgccatgtaacaaagtaccatggaccaaagtaccatggaccaaggctccatagaccaaggtaccatggac                       

What is going on here? It is not immediately obvious, but these sequences are tandem repeats: one repeat unit is highlighted. The similarity between the C. elegans repeat unit and the reversed P. pacificus repeat unit is (presumably) pure coincidence, but the repetition amplifies it into a strong similarity. Thus, DustMasker does not eliminate spurious alignments.

The situation with proteins is similar. We masked some animal and plant proteins with SegMasker, and then looked for similar regions between the animal sequences and the reversed plant sequences. We again found many strong similarities, including this one:

FQDIPSQKTPSQGTPYQDILSQKTPSEAYQDIPSQKTPSQGTPYQDILSQKTPSEAYQDIPSQRTPSQGTPYQDTLSQKTPSEAYQDIPSQKTPSQGTPY
||   |     |    |   || |     |   ||    |   |    || |     |   ||||| |   |    || |    |     | |  |   |
FQNAMSEGSSPQ--KFQNAMSQRTSQQKFQKVMSQRNSPQ--MYENTTSQQTSPQKFQNAMSQRTPLQ--MYENPASQRTSPQMYEYAAFQRTSPQ--MY

QDTLSQKTPSEAYQDIPSQKTSSQGIPYQDIPSQKTSSQGTPYQDIPSQKTPSQETPYQDTLSQKTSSQ                               
    || |    |    || || |   |    || || |   |    || |  |   |    || || |                               
EIATSQRTSPQMYENATSQRTSPQ--IYDNAISQRTSPQ--IYDNAISQRTSPQ--IYDNAISQRTSPQ                               

So SegMasker does not eliminate spurious alignments either.

When we repeated these tests with tantan, however, we found no strong similarities. So tantan does seem to eliminate spurious alignments more reliably than previous methods, though it is impossible to be sure that tantan will always work.

For more details, please see: A new repeat-masking method enables specific detection of homologous sequences, MC Frith, Nucleic Acids Research 2011 39(4):e23.