SCMHBP: Prediction and analysis of heme binding proteins using propensity scores of dipeptides (#19)
Background: Heme binding proteins (HBPs) are metalloproteins containing a heme ligand (an iron-porphyrin complex) as the prosthetic group. Several computational methods were proposed to predict heme binding residues to understand the interactions between heme and its host proteins. However, few in silico methods are reported to identify HBPs.
Results: This study proposes a scoring card method (SCM) based method (named SCMHBP) for predicting and analyzing HBPs from sequences. A balanced dataset of 747 HBPs (selected using a Gene Ontology term GO:0020037) and 747 non-HBPs (selected from 92,309 putative non-HBPs) with identity 25% was established first. Consequently, a set of propensity scores of amino acids and dipeptides to be HBPs using SCM is estimated by maximizing the prediction accuracy of SCMHBP. Finally, we identify informative physicochemical properties by utilizing the estimated propensity scores for categorizing HBPs. The training and mean test accuracies of SCMHBP on three independent test datasets are 85.90% and 83.76%, respectively. SCMHBP performs well, compared with some methods such as support vector machine (SVM), decision tree J48, and Bayes classifiers. The putative non-HBPs with high sequence propensity scores are potential HBPs which can be further validated. The propensity scores of individual amino acids and dipeptides are examined to recognize the interactions between heme and its host proteins. Moreover, the following characteristics of BLPs are derived from the propensity scores: 1) aromatic side chains are important to the performance of specific HBP functions; 2) hydrophobic environment plays an important role in the interaction between heme and binding sites; and 3) low flexibility of the whole HBP while the heme binding residues are relatively flexible.
Conclusions: SCMHBP aims at discovering knowledge for further understanding HBPs rather than pursuing high prediction accuracy only. The used datasets and source codes of SCMHBP are available at http://iclab.life.nctu.edu.tw/SCMHBP/.