Main claim
- An exact local alignment algorithm that is much faster than Smith-Waterman-Gotoh. Wow!
Find the highest-scoring local alignment of two sequences:
Smith-Waterman-Gotoh algorithm. This is a dynamic programming (DP) algorithm, which calculates a score matrix of size m*n. So the time requirement is O(m*n). This is too slow for genome-size data.
To explain this, let's start with a simple but very slow algorithm, and then improve it.
Align P to every suffix of T. For every suffix S, we calculate a DP matrix of size m*|S|. This obviously works, but the time requirement is O(m*n*n).
Only consider alignments that begin at the start of S. This is OK because alignments that do not begin at the start of S will be picked up when we consider other suffixes. To do this, whenever the DP score becomes <= 0, reset it to -infinity. In other words, we are not allowed to re-start alignments past the beginning of S.
"Pruning": Calculate the DP matrix row-by-row, and quit early if all the scores in one row are -infinity. (Each row has m entries.)
Suffix merging: Many of the suffixes probably begin with the same sequence. For example, many suffixes begin with aggct... The first few rows of the DP matrix will be identical for these suffixes. We can avoid repeating these identical calculations.
Example: Suppose two suffixes are tggcac and tggctggcac. First, we calculate the DP matrix for P versus tggcac. Then, we keep the first four rows of the matrix, and fill in the remaining rows for P versus tggctggcac.
When we merge suffixes like this, it looks like a tree, so this is called a "suffix tree".
It's hard to tell. Probably, it works better when the mismatch and gap costs are high compared to the match score, so that pruning happens earlier.
The article reports that it is "at least 1000 times faster" than Smith-Waterman-Gotoh, with a match score of +1 and a mismatch score of -3. It also claims that, for gapless alignment (match score = +1, mismatch score = -3, gap score = -infinity), the time requirement is O(m * n^0.628).
Standard data structures for suffix trees take a lot of memory, making genomic applications difficult.
A suffix array holds the suffixes of the text in alphabetical order. If we take the suffixes in alphabetical order, the "suffix merging" described above will work efficiently. For example:
Text: BANANA$
| Suffixes: | Sorted suffixes: |
|---|---|
| 0:BANANA$ | 1:ANANA$ |
| 1:ANANA$ | 3:ANA$ |
| 2:NANA$ | 5:A$ |
| 3:ANA$ | 0:BANANA$ |
| 4:NA$ | 2:NANA$ |
| 5:A$ | 4:NA$ |
| 6:$ | 6:$ |
Suffix array: 1,3,5,0,2,4,6
Memory requirement: 4n bytes (assuming each number requires 4 bytes). This is less than for traditional suffix tree data-structures, but it is still a bit high.
A Burrows-Wheeler transform (BWT) is closely related to a suffix array. Instead of using numbers, it uses the preceding letter of each suffix. For example:
Text: BANANA$
| Suffixes: | Sorted suffixes: |
|---|---|
| $:BANANA$ | B:ANANA$ |
| B:ANANA$ | N:ANA$ |
| A:NANA$ | N:A$ |
| N:ANA$ | $:BANANA$ |
| A:NA$ | A:NANA$ |
| N:A$ | A:NA$ |
| A:$ | A:$ |
BWT: BNN$AAA
How can we use the BWT? It's a bit complicated. Suppose we know that suffixes starting with gcca are at positions [i..j) within the suffix array. As explained below, we can deduce the positions [p..q) where suffixes starting with Zgcca are in the suffix array. (Z = one letter.)
Since this method goes backwards (gcca -> Zgcca), we make a Burrows-Wheeler transform of the reversed sequence. Using this method, we can get all the suffixes that exist in the text in alphabetical order (a -> aa, aa -> aaa, etc.)
Method detail: How can we deduce [p..q) from [i..j) ?
Memory requirement: 2n bits (assuming each letter requires 2 bits). Also, some memory for C(Z) and Occ(Z, i). Note: we do not need the suffix array, and we do not even need the original sequence!
This BWT algorithm is attributed to: P Ferragina & G Manzini, "Opportunistic data structures with applications", FOCS (2000) 390-398.