6 months ago

Abstract

Over the past two decades, CNN architectures have produced compelling modelsof sound perception and cognition, learning hierarchical organizations offeatures. Analogous to successes in computer vision, audio featureclassification can be optimized for a particular task of interest, over a widevariety of datasets and labels. In fact similar architectures designed forimage understanding have proven effective for acoustic scene analysis. Here wepropose applying Transformer based architectures without convolutional layersto raw audio signals. On a standard dataset of Free Sound 50K,comprising of 200categories, our model outperforms convolutional models to produce state of theart results. This is significant as unlike in natural language processing andcomputer vision, we do not perform unsupervised pre-training for outperformingconvolutional architectures. On the same training set, with respect meanaver-age precision benchmarks, we show a significant improvement. We furtherimprove the performance of Transformer architectures by using techniques suchas pooling inspired from convolutional net-work designed in the past few years.In addition, we also show how multi-rate signal processing ideas inspired fromwavelets, can be applied to the Transformer embeddings to improve the results.We also show how our models learns a non-linear non constant band-widthfilter-bank, which shows an adaptable time frequency front end representationfor the task of audio understanding, different from other tasks e.g. pitchestimation.

Source PDF