Efficient New Algorithms for Optimally Aligning Short Motifs

Paul Horton

In this paper we introduce two key concepts and corresponding algorithms for the problem of local multiple alignment of motifs (substrings) from a collection of sequences. The first algorithm is independent of the scoring function while the second assumes an information theoretic entropy scoring function.

The first key concept is the notion of a consistent local alignment. We show that enumerating the consistent local alignments of a set of sequences is sufficient to optimally solve the local alignment problem for any scoring function. Thus the algorithm can be used for efficient parametric local alignment. Moreover, in the limit as the number of sequences n grows, while all other parameters, in particular alphabet size σ and motif width w, are held fixed, the number of consistent local alignments does not depend on n. Thus for short motifs with small alphabets our algorithm computes optimal alignments in O ( ln ) time, where l is the length of the sequences. This compares favorably to the na\"{\i}ve algorithm which requires O ( l n ) time.

The second algorithm uses a property of a specific scoring function, namely the entropy scoring function. We prove that for this scoring function any consistent partial alignment which scores worse than a lower bound on the overall optimal score can be discarded without fear of missing the optimal alignment.
The second algorithm is particularly fast in practice; our current implementation can find the optimal motif alignment for 60 DNA sequences with a motif length of 4. The best previously known algorithm took more time to align the first 30 sequences alone.

In the appendix we also briefly describe an alternative algorithm for local multiple sequence alignment which requires O ( nw(σ - 1)nl ) time.


Real World Computing Partnership