Tamotsu NOGUCHI, Kentaro ONIZUKA, Yutaka AKIYAMA, and Minoru SAITO
Parallel Application Laboratory, Tsukuba Research Center
Real World Computing Partnership
DATABASE OF REPRESENTATIVE PROTEIN CHAINS IN PDB(PROTEIN DATA BANK)
Version 9.0 (PDB Rel. #86) Jan 1999
( This document was updated at 1 Mar 1999 )
Noguchi T., Onizuka K., Akiyama Y., Saito M. (1997).
"PDB-REPRDB: A Database of Representative Protein Chains in PDB (Protein Data
In Proceedings of the Fifth International Conference on Intelligent Systems
Molecular Biology, AAAI press, Menlo Park, CA.
The PDB-REPRDB consists of a list of representative protein chains. The
criteria of selecting the representatives are,
The first version of PDB-REPRDB consisted of 763 representative chains from
Release 70 (Oct. 1994) at Brookhaven National Laboratory and was released on
GenomeNet WWW server (http://www.genome.ad.jp/htbin/show_pdbreprdb) in
- ) quality of the atomic coordinate data,
- ) sequence uniqueness, and
- ) conformation
uniqueness, particularly local conformation uniqueness.
The second version (PDB-REPRDB Ver. 2.0) used data from the PDB Release 78
(Oct. 1996) and
was released in April 1997 on our server
The third version (PDB-REPRDB Ver. 3.0) used data from the PDB Release 80
"PDB-RERPDB" has been available on the server
The fourth version (PDB-REPRDB Ver. 4.0) uses data from PDB Release 81 (Jul.
1997) and includes
Although the selection criteria remains essentially the same as in
the first one, from the fifth version the selection procedure has been almost completely
automated, by the use of a parallelized algorithm for a quick selection of
representative chains .
2. COPYRIGHT NOTICE
Copyright 1998,1999 by Real World Computing Partnership(RWCP). All rights
For further information regarding permission for use or reproduction, please
contact to Tamotsu Noguchi, Real World Computing Partnership,Parallel
Laboratory, Tsukuba Research Center, Tsukuba Mitsui Bldg., 1-6-1 Takezono,
The representative protein chains are selected as follows.
( Since R-factor and resolution values are written in the unformatted form in
the PDB file read program may in some cases find an incorrect value or
not find a value for R-factor and/or resolution. )
- Exclude the following entries from the selection
- DNA and RNA data
- theoretically modeled data
- short chains (l < 40 residues)
- data with incomplete backbone coordinates
- data with incomplete side chain coordinates
- data without refinement (by X-PLOR, TNT, etc.)
- Data of NMR spectroscopy are included in version 4.0
The selected chains in the entries are sorted according to their data
quality as follows:
First, the selected chains are classified into three classes.
Class A chains are those with both good resolution (<= 3.0 angstrom)
and good R-Factor (<= 0.3). Class B chains are those with resolution
(>= 3.0 angstrom) and R-Factor (>= 0.3). The chains derived by NMR
spectroscopy are classified into class C.
Second, we sort the chains with respect to the resolution of
structure determination within each class (A and B), and concatenate
the class C chains.
The chains with the same resolution are further sorted by R-Factor value.
When several chains have the same resolution and R-Factor, they are further
- the number of chain breaks (the less the better)
- the number of non-standard amino acid residues (the less the better)
- the number of residues without backbone coordinates (the less the
- the number of residues without side chain coordinates (the less the
- whether mutant or wild (the wild type has priority)
- whether complex or not (the non-complex has priority)
- alphabetical order of the entry name
(e.g. 1MCD < 1MCE, 5AT1A < 5AT1C)
The similar chains are eliminated from the sorted list as follows:
The first chain in the sorted list has the best quality and is the first
representative. Then, suppose we have already selected N representatives
from the sorted list.
Now, the sorted list does not contain the chains
Thus, the first chain remaining in the sorted list has the highest priority to
be the next representative. We check the "similarity" between the first
chain of the list and each of the chains in the sorted list.
After that, the first chain becomes the (N+1)-st representative. And if
the first chain is similar to some chain, the chain will be eliminated
from the sorted list, and then, the second chain comes to the first of
the sorted list. This procedure repeats until the sorted list goes to null.
have already been selected and taken out as representatives,
- that have
already been eliminated through the selection procedure
of the N representatives.
We consider the chain is NOT similar, either
Before superimposing the two structures, we align the two sequences
the pairwise sequence alignment to check their sequential identity. The
residues in the alignment are superimposed by the least square fitting
Finally, all the chains (in class A, B and C) are classified into
protein-chain groups, where each chain is classified into the group
whose representative chain is sequentially nearest to the chain.
- if the sequential identity is less than a certain threshold value (<=
95%, 85%, 75%, 65%, 55%, 45%, 35% or 25%), where the sequence
identity is measured after a pairwize sequence alignment,
or, if the maximum value of the distance between the "CA" atoms when
structures are superimposed is greater than a certain threshold (>= 10, 20,
30, 40, 50 angstrom or infinity).
4. FORMAT OF THE DATABASE
The database of representative protein chains in PDB, which are
selected by the above method, consists of a Table of
PDB-REPRDB Version 3.0, which shows the number of selected chains at several
threshold sequence identity(ID%) and maximum distance between the pair of
each from the two structures (Dmax).
The values in this table are hyperlinked to the list of PDB-REPRDB at
the corresponding threshold sequence identity(ID%) and structure similarity
The list contains PDB entry IDs, chain IDs, "*", number of amino acids
resolution (Res), R-fator (Rfac), experimental method (Methd), the number of
residues with side chain coordinates (n_sid), the number of residues with
backbone coordinates (n_bck),the number of residues with CA coordinates
the number of non-standard amino acid residues(n_naa),
the number of chain breaks, EC number and COMPND
The ID of protein chains (PDB entry IDs and chain IDs) in the list is
alphabetically and hyperlinked to the list of similar proteins that are not
selected as the representative chain, and if the "*" is clicked, the protein
3D view will be displayed by "Rasmol".
In addition, the EC number in the list is hyperlinked to the corresponding
We thank Dr. Susumu Goto and Prof. Minoru Kanehisa at Institute
for Chemical Research, Kyoto University for useful discussions and
Link to PAPIA system