The Project to shed light on GPCR genes from the human genome began around the time of the CBRC¡Çs foundation. However, it wasn¡Çt a smooth beginning. The project was unprecedented at the time, and so initially the participants had to seek out a way to begin and proceed.
Naturally, there was some general idea of how the project should proceed. Through joint research with Dr. Akiyama, an expert in parallel computing environments, and Dr. Asai, an expert in mathematical model, Suwa thought they would be able to conduct powerful analysis applying sophisticated mathematical methods. "From the start, we had a clear idea on how to proceed. We discussed this idea, talking through concepts one by one until we came up with a concrete way of converting these ideas into a tangible form."
Their idea was to identify GPCR genes by combining multiple analysis tools in a pipeline configuration. They would bring together various existing programs for conducting activities such as gene discovery, sequence searches, motif identification, and transmembrane prediction, determine appropriate threshold values, mutually exchange the data, and produce final analysis results.
From the start of the project, the project members worked hard on the construction of the pipeline formation. Over one year of basic research was necessary to efficiently detect GPCR genes with high sensitivity and specificity using one pipeline. The project members didn¡Çt want gene prediction to be something akin to weather forecasting; if they could not conduct analysis with extremely high levels of sensitivity and specificity, the project would have no meaning. Many days were spent improving and reforming the system, repeatedly delving into questions such as: What sort of value should the threshold be when analyzed data is passed from one program to another? What about the sensitivity and specificity of those programs? In what order should the programs be combined? At the same time, extra consideration was given to the overall structure of the pipeline. How would they go about using the capabilities of the computers to increase precision while also efficiently conducting automated analysis? Suwa was able to show a 3cm thick file filled with revised plans showing each improvement to the pipeline.
The Project progressed through trial and error. In the process of building the database, hundreds of novel GPCR genes were identified. Although this did not immediately provide benefits for the field of pharmaceuticals, it did have an impact through patent applications and disclosures resulting from the newly discovered genes.
The Project, which had began in 2001, was preliminarily completed in 2003. Given the name "SEVENS" the GPCR database was fully formed and released on the Internet (http://sevens.cbrc.jp). The sensitivity and specificity achieved in identifying GPCR genes through the aforementioned pipeline was high, at 99.4% and 96.6% respectively. Naturally, this did not mean the Project members¡Ç work had completely finished. Even after release in 2003, further improvements were made to the system. The purpose in doing so was to build a database with as much functional information added as possible. The database continues to be improved in order to make it easier to use for researchers, taking into account feedback received from researchers accessing the SEVENS database.
Development of the "GRIFFIN" tool for predicting what type of G-proteins selectively couple with GPCR genes was tackled in parallel with the project. By inputting ligand data and a GPCR sequence into GRIFFIN, the tool predicts which G-protein the GPCR will couple with. During the development of the tool, a method was sought for extracting regions essential to G-protein coupling selectivity from the GPCR sequence and functionally categorizing GPCRs based on this process. In other words, by extracting "n" number of features that affect G-protein selectivity from the physicochemical features of ligands, GPCRs and G-proteins, GPCRs are expressed as n-dimensional vectors. Prediction is possible by plotting the vector on the region in an n-dimensional space. Using the SVM (Support Vector Machine), a machine learning method, prediction with close to 90% sensitivity and specificity is possible for G-proteins.