Automated function prediction in protein structures

WebFEATURE is a web-based implementation of a machine learning algorithm for functional site recognition in protein structures. A few of the models provided are built from hand-curated training sets, but the majority come from the SeqFEATURE library of models [Wu et al. 2008]. SeqFEATURE is a method for automatically selecting training sets and building structural models within the FEATURE framework, which is an existing method for modeling functional sites. SeqFEATURE adds to this framework by using 1-D sequence motifs as seeds for generating training sets of structural examples [Liang et al, 2003].

The FEATURE system

FEATURE is a method for building statistical 3-D models of the local environment around a functional site given training sets of positive and negative examples. These models can then be used to evaluate test sites and predict whether these sites have particular functions. Briefly, FEATURE calculates a number of physicochemical properties at varying radial distances from the site center and creates a feature vector containing the values of each property in each radial volume. Both atomic and residue level properties are examined, allowing the functional site to be described at multiple levels. The structural model is constructed by comparing the statistical distribution of feature vectors between positive and negative sites. Properties are then described as either significantly more present or absent in the positives sites compared to negative sites using the Wilcoxan rank sum test.

WebFEATURE 4.0 employs both Naive Bayes [Liang et al. 2003] and Support Vector Machines (SVM) [Buturović et al. 2014] to classify functional site/non-site over an ensemble of protein functional classes. Using a naive Bayes scoring function, WebFEATURE can evaluate the likelihood that a test site contains the function described by a particular model. A feature vector is created for the test site in the same way as for the training sites, and scored assuming independence of each individual feature vi:

Score = Sum(i) log [P(Site | vi)/P(Site)]

A score cutoff for classifying a site can be chosen for each model according to the user's desired precision requirements; the default is 99% precision. WebFEATURE also applies libSVM [Chang et al. 2011] to provide a probabilistic predictive model. The scores from SVM prediction are probabilities for classification and the cutoffs are user-controllable, defaulting to 99% precision. The FEATURE system is described in more detail elsewhere (see publications).

Training set selection

For building the SeqFEATURE models, we used PROSITE patterns not labeled as having a high probability of occurrence, and extracted structural examples of these patterns from the PDB. Only patterns with 5 or more unique structural examples were used to build models. PROSITE patterns are regular expressions in which functional residues are usually designated. We defined possible functional site centers to be the functional atom(s) of annotated functional residues (e.g. the gamma oxygen of serine, or SER.OG) in each pattern; for patterns with multiple functional residues or multiple functional atoms, this resulted in multiple models for the same PROSITE pattern. For example, the PROSITE pattern EGF_1 has functional cysteine residues at positions 1, 3, and 7 of the motif, so we built models EGF_1.1.CYS.SG, EGF_1.3.CYS.SG, and EGF_1.7.CYS.SG.

Positive training sets consist of PDB coordinates of functional atoms like the ones described, extracted from structures containing a particular pattern. Negative training sets were selected randomly and automatically from the rest of the PDB to be of identical residue and atom makeup and similar atom density to positive sites. We sampled around 50,000 negative sites for each model, when possible.

Model cross-validation and evaluation

We evaluated each model using 5-fold stratified cross validation. The positive and negative training sets were uniformly partitioned randomly into five blocks; four were used for training the model, which was then tested on the remaining block. This process was performed five times, where each block was left out once. To compare results across runs, NB scores were normalized to z-scores according to the mean and standard deviation of the score distribution. Performance is measured by plotting precision vs. recall comparing recall at 99% precision. The final models were built using all of the training examples, which should result in performances at least as good as those obtained during cross-validation.

The WebFEATURE server

All of the models are available online through this interactive web server. The data is currently available for download as flat files. Users may also perform real-time scans of structures using any of the models in the SeqFEATURE library. The only input needed is the PDB structure and the model (i.e. function) for which to scan. The list of sites to evaluate is determined by the type of residue and atom that the model uses. Feature vectors are then constructed for each site in the list, and these are individually scored against the model feature distribution. A list of the sites with their scores are returned to the user.