AIST > CBRC > SEQ > MCF


Project Ideas

These are indications of my current interests, rather than things I will definitely work on. I have not thoroughly checked previous research in all of these areas - some may be moot.

Genome codes

Infer codes from substitutions
Can we re-discover the genetic code from substitution patterns between homologous sequences? Can we discover any new codes?

Functional Non-Protein-Coding Sequence

Nucleotide interactions
Identify trans-interacting nucleotide sequences from interaction-preserving substitutions between species.

Genome Evolution

Freezing
Evolution of greater complexity seems to be accompanied by sequential "freezing" of characters (e.g. nucleosomes, body plan, number of limbs, brain elaboration). Can we identify corresponding freezes in genome sequence evolution?
Gene decay
Catalog cases of gene decay and try to correlate them with changes in phenotype / environment / lifestyle. One example is the loss of the vitamin C enzyme in primates and guinea pig.
"Interesting" protein evolution
Assess the extent of non-standard protein evolution, meaning anything other than duplication and small insertions, deletions, and substitutions. For instance: changes in reading frame, fusion, and fission. Which human proteins have interesting histories?
Tandem repeat TFBSs
Genomes are full of fast-evolving tandem repeats, and in some cases the repeat unit happens to match a transcription factor binding motif. Repeated binding sites are known to affect transcription strongly. So how have tandem repeats contributed to evolution of gene expression?
CpG island evolution
How have CpG islands evolved (e.g. appearance and disappearance), and how has this altered gene expression?
Transposable element effects on transcription
Have transposable element insertions near promoter regions altered transcription?
Differing modes of evolution
An alignment of 70% identical insect genomes looks different from an alignment of 70% identical vertebrate genomes. Can we quantify and explain this?

Sequence Alignment

An old topic, but fundamental, and I believe there is room for improvement. We need to deal with even more data from new sequencing technologies like 454.

Adaptive seeds
Genome-scale sequence alignment remains difficult. Seed-and-extend methods could be improved by choosing seeds based on multiplicity rather than size (e.g. use longer seeds for repetitive sequence).
Better extension
The extension phase of seed-and-extend methods like BLAST receives little attention. It tends to be kludgy. Can we develop an extension algorithm with useful theoretical guarantees? E.g. guarantee to return a subset of the alignments defined by some reasonable optimality criterion.
Orthologs vs outparalogs
Genome-scale alignment tends to produce too many results. Can we filter outparalog alignments accurately?
DNA multiple alignment
Anecdotally, multiple alignment algorithms such as Clustal give poor results with DNA sequences (e.g. promoters), perhaps because they were designed more for proteins. Does any existing method work well? If not, can we do better?

Protein Identification

There is an opinion that this is mostly solved: once a genome is sequenced, its complement of encoded proteins is basically known. I believe this is false.

Find proteins by synonymous vs nonsynonymous substitutions
Identify proteins from excessive synonymous versus nonsynonymous substitutions. (Genome-wide application of the idea in Frith et al. 2006.)
Phylogenetic synonymous vs nonsynonymous method
Extend the synonymous versus nonsynonymous method from pairwise to multiple, closely-related sequences. This should increase the statistical power to detect arbitrarily short and fast-evolving proteins.
Non-pseudogenes
Identify proteins whose reading frames are damaged in the reference genome sequence, but not in alternate haplotypes.

Motifs

Large-scale motif discovery
There are many algorithms to identify subtle, recurring motifs in small sequence sets (e.g. promoters): can we develop one for less-subtle motifs in large sequences (e.g. bacterial genomes)? This might help identify RNA genes, for instance.
Diagnostic motif discovery
Systematically apply motif discovery algorithms (rather than making yet another one) to find motifs diagnostic of various categories of protein, RNA, and maybe DNA sequences.

Valid HTML 4.01
Strict

Last modified 2007-05-11