AIST > CBRC > SEQ > MCF
Project Ideas
These are indications of my current interests, rather than things I
will definitely work on. I have not thoroughly checked previous
research in all of these areas - some may be moot.
Genome codes
- Infer codes from substitutions
- Can we re-discover the genetic code from substitution patterns
between homologous sequences? Can we discover any new codes?
Functional Non-Protein-Coding Sequence
- Nucleotide interactions
- Identify trans-interacting nucleotide sequences from
interaction-preserving substitutions between species.
Genome Evolution
- Freezing
- Evolution of greater complexity seems to be accompanied by
sequential "freezing" of characters (e.g. nucleosomes, body plan,
number of limbs, brain elaboration). Can we identify corresponding
freezes in genome sequence evolution?
- Gene decay
- Catalog cases of gene decay and try to correlate them with changes
in phenotype / environment / lifestyle. One example is the loss of the
vitamin C enzyme in primates and guinea pig.
- "Interesting" protein evolution
- Assess the extent of non-standard protein evolution, meaning
anything other than duplication and small insertions, deletions, and
substitutions. For instance: changes in reading frame, fusion, and
fission. Which human proteins have interesting histories?
- Tandem repeat TFBSs
- Genomes are full of fast-evolving tandem repeats, and in some
cases the repeat unit happens to match a transcription factor binding
motif. Repeated binding sites are known to affect transcription
strongly. So how have tandem repeats contributed to evolution of gene
expression?
- CpG island evolution
-
- How have CpG islands evolved (e.g. appearance and disappearance),
and how has this altered gene expression?
- Transposable element effects on transcription
- Have transposable element insertions near promoter regions altered
transcription?
- Differing modes of evolution
- An alignment of 70% identical insect genomes looks different from
an alignment of 70% identical vertebrate genomes. Can we quantify and
explain this?
Sequence Alignment
An old topic, but fundamental, and I believe there is room for
improvement. We need to deal with even more data from new sequencing
technologies like 454.
- Adaptive seeds
- Genome-scale sequence alignment remains difficult. Seed-and-extend
methods could be improved by choosing seeds based on multiplicity
rather than size (e.g. use longer seeds for repetitive sequence).
- Better extension
- The extension phase of seed-and-extend methods like BLAST receives
little attention. It tends to be kludgy. Can we develop an extension
algorithm with useful theoretical guarantees? E.g. guarantee to return
a subset of the alignments defined by some reasonable optimality
criterion.
- Orthologs vs outparalogs
- Genome-scale alignment tends to produce too many results. Can we
filter outparalog alignments accurately?
- DNA multiple alignment
- Anecdotally, multiple alignment algorithms such as Clustal give
poor results with DNA sequences (e.g. promoters), perhaps because they
were designed more for proteins. Does any existing method work well?
If not, can we do better?
Protein Identification
There is an opinion that this is mostly solved: once a genome is
sequenced, its complement of encoded proteins is basically known. I
believe this is false.
- Find proteins by synonymous vs nonsynonymous substitutions
- Identify proteins from excessive synonymous versus nonsynonymous
substitutions. (Genome-wide application of the idea in Frith
et al. 2006.)
- Phylogenetic synonymous vs nonsynonymous method
- Extend the synonymous versus nonsynonymous method from pairwise to
multiple, closely-related sequences. This should increase the
statistical power to detect arbitrarily short and fast-evolving
proteins.
- Non-pseudogenes
- Identify proteins whose reading frames are damaged in the
reference genome sequence, but not in alternate haplotypes.
Motifs
- Large-scale motif discovery
- There are many algorithms to identify subtle, recurring motifs in
small sequence sets (e.g. promoters): can we develop one for
less-subtle motifs in large sequences (e.g. bacterial genomes)? This
might help identify RNA genes, for instance.
- Diagnostic motif discovery
- Systematically apply motif discovery algorithms (rather than
making yet another one) to find motifs diagnostic of various
categories of protein, RNA, and maybe DNA sequences.

Last modified 2007-05-11