An empirical study of different approaches for protein classification


Many domains would benefit from the development of reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful only in a few domains. Our aim in this paper is to evaluate several feature extraction approaches for representing proteins and to test them across multiple datasets. Several different types of protein representations are evaluated: those starting from the position specific scoring matrix (PSSM) of the proteins, those derived from the amino-acid sequence, two matrix representations of the protein, and features taken from the 3D tertiary structure of the protein. Moreover, some new variants of proteins descriptors are tested in this work. Our goal is to develop a system experimentally by comparing and combining different descriptors taken from the protein representations?a system that performs well across a number of benchmark datasets. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, however, the different descriptors provide a performance that works well across all the tested datasets, in some cases performing better than the state of the art. The MATLAB code used in this paper will be available at

[full paper]