Acoustic scene classification is the task of assigning scene labels to audio based on the recorded environment. Although deep learning models perform well in the field, they usually rely on hardware platforms with high computing power, which limits the popularity of deep learning methods in practical applications. To address this issue, we propose a lightweight High-Resolution network (Lite-HRNet) with only 76.224K training parameters. The structure is based on Lite-Net, a lightweight model constructed using inverted residual modules. Based on the Lite-Net, we introduce the High-Resolution (HR) structure to maintain high-resolution in the frequency axis direction, which effectively fuses high-resolution and low-resolution features in parallel, maintaining low complexity. In addition, the coordinate attention mechanism (CA) is introduced to direct the network's focus towards critical information. Experimental results show that Lite-HRNet improves the classification accuracy of various scenes in the TAU Urban Acoustic Scenes 2022 Mobile Development dataset, achieving an average accuracy of 52.4%, which is a 9.5% improvement compared to the DCASE baseline system.
|