Automated function prediction in protein structures

Methods  
Methods

WebFEATURE is a web-based implementation of a machine learning algorithm for functional site recognition in protein structures. A few of the models provided are built from hand-curated training sets, but the majority come from the SeqFEATURE library of models (Wu et al, to be published). SeqFEATURE is a method for automatically selecting training sets and building structural models within the FEATURE framework, which is an existing method for modeling functional sites. SeqFEATURE adds to this framework by using 1-D sequence motifs as seeds for generating training sets of structural examples (Liang MP et al, 2003).

The FEATURE system
FEATURE is a method for building statistical 3-D models of the local environment around a functional site given training sets of positive and negative examples. These models can then be used to evaluate test sites and predict whether these sites have particular functions. Briefly, FEATURE calculates a number of physicochemical properties at varying radial distances from the site center and creates a feature vector containing the values of each property in each radial volume. Both atomic and residue level properties are examined, allowing the functional site to be described at multiple levels. The structural model is constructed by comparing the statistical distribution of feature vectors between positive and negative sites. Properties are then described as either significantly more present or absent in the positives sites compared to negative sites using the Wilcoxan rank sum test.

Using a naive Bayes scoring function, FEATURE can then evaluate the likelihood that a test site contains the function described by a particular model. A feature vector is created for the test site in the same way as for the training sites, and scored assuming independence of each individual feature vi:

Score = Sum(i) log [P(Site | vi)/P(Site)]

A score cutoff for classifying a site can be chosen for each model according to the user's desired sensitivity and specificity requirements; the default is 99% specificity. The FEATURE system is described in more detail elsewhere (see publications).

Training set selection
For building the SeqFEATURE models, we used PROSITE patterns not labeled as having a high probability of occurrence, and extracted structural examples of these patterns from the PDB. Only patterns with 5 or more unique structural examples were used to build models. PROSITE patterns are regular expressions in which functional residues are usually designated. We defined possible functional site centers to be the functional atom(s) of annotated functional residues (e.g. the gamma oxygen of serine, or SER.OG) in each pattern; for patterns with multiple functional residues or multiple functional atoms, this resulted in multiple models for the same PROSITE pattern. For example, the PROSITE pattern EGF_1 has functional cysteine residues at positions 1, 3, and 7 of the motif, so we built models EGF_1.1.CYS.SG, EGF_1.3.CYS.SG, and EGF_1.7.CYS.SG.

Positive training sets consist of PDB coordinates of functional atoms like the ones described, extracted from structures containing a particular pattern. Negative training sets were selected randomly and automatically from the rest of the PDB to be of identical residue and atom makeup and similar atom density to positive sites. We used a thousand times as many negative sites as positive sites for each model, when possible.

Model cross-validation and evaluation
We evaluated each model using 5-fold cross validation. The positive and negative training sets were partitioned randomly into five blocks; four were used for building the model, which was then tested on the left out block. This process was performed five times, where each block was left out once. To compare results across runs, scores were normalized to z-scores according to the mean and standard deviation of the score distribution. Performance is measured using a receiver operator characteristic (ROC) curve, where the z-score cutoff is varied and true positive rate (sensitivity) and false positive rate (1-specificity) are plotted. Area under the curve (AUC) estimates the probability that a random positive site will be scored higher than a random negative site, and provides a summary measure of the performance of the model. We also provide plots of positive predictive value vs. sensitivity. The final models were built using all of the training examples, which should result in performances at least as good as those obtained during cross-validation.

Protein Data Bank scan
Any PDB structure can be scanned with any FEATURE model to generate a list of predictions. We scanned every structure in the March 2006 release of the PDB using every model in the SeqFEATURE library. The scans were performed on an Apple cluster consisting of six nodes (14 processors) running the YellowDog Linux operating system. For our analysis, hits were considered to be those sites that scored greater than a 100% specificity cutoff for each model, and these hits were evaluated further.

TargetDB and structural genomics
We focused part of our analysis on structures listed in TargetDB, the database for targets from structural genomics centers worldwide. Using the headers of released PDB files, we filtered for those that lacked functional annotation; for example, "STRUCTURAL GENOMICS," "UNKNOWN FUNCTION", "HYPOTHETICAL PROTEIN", and combinations of these phrases were often the only functional descriptions in the PDB headers. After scanning these structures with the SeqFEATURE models, we extracted all sites that scored highly for well-performing models. The list of extracted sites is currently available for download as a flat file.

The WebFEATURE server
All of the models and all of the scan data are available online through this interactive web server. The data is currently available for download as flat files. Users may also perform real-time scans of structures using any of the models in the SeqFEATURE library. The only input needed is the PDB structure and the model (i.e. function) for which to scan. The list of sites to evaluate is determined by the type of residue and atom that the model uses. Feature vectors are then constructed for each site in the list, and these are individually scored against the model feature distribution. A list of the sites with their scores are returned to the user.