Databases of protein domains and functional sites have become vital resources for the prediction of protein functions. During the last decade, several signature-recognition methods have evolved to address different sequence analysis problems, resulting in rather different and, for the most part, independent databases. Diagnostically, these resources have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods. Thus, for best results, search strategies should ideally combine all of them. InterPro ([1]) is a collaborative project aimed at providing an integrated layer on top of the most commonly used signature databases by creating a unique, non-redundant characterisation of a given protein family, domain or functional site. The InterPro database integrates PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRFAMs, PIR superfamily, SUPERFAMILY Gene3D and PANTHER databases. The InterPro project home page is available at http://www.ebi.ac.uk/interpro.

PROSITE patterns
Some biologically significant amino acid patterns can be summarised in the form of regular expressions. ScanRegExp (by Wolfgang.Fleischmann@ebi.ac.uk)
PROSITE profiles
There are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence, so the use of techniques based on weight matrices (also known as profiles) allows the detection of such proteins or domains. A profile is a table of position-specific amino acid weights and gap costs. The profile structure used in PROSITE is similar to but slightly more general (Bucher P. et al., 1996 [2]) than the one introduced by M. Gribskov and co-workers. pfscan from the Pftools package (by Philipp.Bucher@isrec.unil.ch).
PRINTS
The PRINTS database houses a collection of protein family fingerprints. These are groups of motifs that together are diagnostically more powerful than single motifs by making use of the biological context inherent in a multiple-motif method. The fingerprinting method arose from the need for a reliable technique for detecting members of large, highly divergent protein super-families. FingerPRINTScan (Scordis P. et al., 1999 [.]).
PFAM
Pfam is a database of protein domain families. Pfam contains curated multiple sequence alignments for each family and corresponding hidden Markov models (HMMs) (Eddy S.R., 1998 [4]). Profile hidden Markov models are statistical models of the primary structure consensus of a sequence family. The construction and use of Pfam is tightly tied to the HMMER software package. hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
PRODOM
ProDom is a database of protein domain families obtained by automated analysis of the SWISS-PROT and TrEMBL protein sequences. It is useful for analysing the domain arrangements of complex protein families and the homology relationships in modular proteins. ProDom families are built by an automated process based on a recursive use of PSI-BLAST homology searches. ProDomBlast3i.pl (by Emmanuel Courcelle emmanuel.courcelle@toulouse.inra.fr and Yoann Beausse beausse@toulouse.inra.fr) a wrapper on top of the Blast package (Altschul S.F. et al., 1997 [5]).
SMART
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. SMART alignments are optimised manually and following construction of corresponding hidden Markov models (HMMs). hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
TIGRFAMs
TIGRFAMs are a collection of protein families featuring curated multiple sequence alignments, Hidden Markov Models (HMMs) and associated information designed to support the automated functional identification of proteins by sequence homology. Classification by equivalog family (see below), where achievable, complements classification by orthologs, superfamily, domain or motif. It provides the information best suited for automatic assignment of specific functions to proteins from large scale genome sequencing projects. hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
PIR SuperFamily.
PIR SuperFamily (PIRSF) is a classification system based on evolutionary relationship of whole proteins. hmmpfam from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
SUPERFAMILY
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure, based on SCOP. hmmpfam/hmmsearch from the HMMER2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu). Optionally, predictions for coiled-coil, signal peptide cleavage sites (SignalP v3) and TM helices (TMHMM v2) are supported (See the FAQs file for details).
GENE3D
Gene3D is supplementary to the CATH database. This protein sequence database contains proteins from complete genomes which have been clustered into protein families and annotated with CATH domains, Pfam domains and functional information from KEGG, GO, COG, Affymetrix and STRINGS. hmmpfam from the HMM2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu).
PANTHER
The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. hmmsearch from the HMM2.3.2 package (by Sean Eddy, eddy@genetics.wustl.edu, http://hmmer.wustl.edu). and blastall from the Blast package (Altschul S.F. et al., 1997 [5]).

References

  1. The InterPro Consortium (*R.Apweiler, T.K.Attwood, A.Bairoch, A.Bateman, E.Birney, M.Biswas, P.Bucher, L.Cerutti, F.Corpet, M.D.R.Croning, R.Durbin, L.Falquet, W.Fleischmann, J.Gouzy, H.Hermjakob, N.Hulo, I.Jonassen, D.Kahn, A.Kanapin, Y.Karavidopoulou, R.Lopez, B.Marx, N.J.Mulder, T.M.Oinn, M.Pagni, F.Servant, C.J.A.Sigrist, E.M.Zdobnov). "The InterPro database, an integrated documentation resource for protein families, domains and functional sites." Nucleic Acids Research, 2001, 29(1): 37-40.

  2. Bucher P., Karplus K., Moeri N., and Hofmann K. "A Flexible Motif Search Technique Based on Generalized Profiles." Comput Chem, 1996, 20(1): 3-23.

  3. Scordis P., Flower D.R., and Attwood T.K. "Fingerprintscan: Intelligent Searching of the Prints Motif Database>" Bioinformatics, 1999, 15(10): 799-806.

  4. Eddy S.R. "Profile Hidden Markov Models." Bioinformatics, 1998, 14(9): p. 755-63.

  5. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J. "Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs." Nucleic Acids Res, 1997, 25(17): p. 3389-402.