A set of descriptors for identifying the protein-drug interaction in cellular networking


The study of protein-drug interactions is a significant issue for drug development. Unfortunately, it is both expensive and time-consuming to perform physical experiments to determine whether a drug and a protein are interacting with each other. Some previous attempts to design an automated system to perform this task were based on the knowledge of the 3D structure of a protein, which is not always available in practice. With the availability of protein sequences generated in the post-genomic age, however, a sequence-based solution to deal with this problem is necessary. Following other works in this area, we propose a new machine learning system based on several protein descriptors extracted from several protein representations, such as, variants of the position specific scoring matrix (PSSM) of proteins, the amino-acid sequence, and a matrix representation of a protein. The prediction engine is operated by an ensemble of support vector machines (SVMs), with each SVM trained on a specific descriptor and the results of each SVM combined by sum rule. The overall success rate achieved by our final ensemble is notably higher than previous results obtained on the same datasets using the same testing protocols reported in the literature. MATLAB code and the datasets used in our experiments are freely available for future comparison at http://www.dei.unipd.it/node/2357.

Keywords drug-target interaction; amino-acid sequence; position specific scoring matrix; machine learning; ensemble of classifiers, support vector machines.

[full paper]