Combining visual and acoustic features for bird species classification


In this paper a novel and effective approach for automated audio classification is presented that is based on the fusion of different sets of features, both visual and acoustic. A number of different acoustic and visual features of sounds are evaluated and compared then fused in an ensemble that produces better classification accuracy than other state-of-the-art approaches. The visual features of sounds are built starting from the audio file and are taken from images constructed from different spectrograms, a gammatonegram, and a rhythm image. These images are divided into subwindows from which a set of texture descriptors are extracted. For each feature descriptor a different Support Vector Machine (SVM) is trained. The SVMs are summed for a final decision. The proposed ensemble is evaluated on three well-known databases of music genre classification (the Latin Music Database, the ISMIR 2004 database, and the GTZAN genre collection) and a dataset of for Bird vocalization recognition. The superior performance of the purposed system is obtained without any ad hoc parameter optimization (i.e. the same ensemble of classifiers and the same parameter settings are used in all four datasets). The MATLAB code for the ensemble of classifiers and for the extraction of the features will be publicly available to other researchers for future comparisons.

Keywords Audio classification, texture, image processing, acoustic features, ensemble of classifiers, pattern recognition

[full paper]