Paper
4 March 2024 Environmental sound classification using vision transformer
Haoran Hong, Junfeng Li, Xingxing Li
Author Affiliations +
Proceedings Volume 12981, Ninth International Symposium on Sensors, Mechatronics, and Automation System (ISSMAS 2023); 129811Z (2024) https://doi.org/10.1117/12.3014876
Event: 9th International Symposium on Sensors, Mechatronics, and Automation (ISSMAS 2023), 2023, Nanjing, China
Abstract
Classification of environmental sounds plays a key role in surveillance systems, crime detection etc. Since the study of the sounds in a real environment can get significant information. Deep learning models, such as convolutional neural networks, have been shown very useful for environmental sound classification (ESC). Recent work has shown that Vision Transformer (ViT) models can achieve comparable or even superior performance on image classification tasks. In the paper, an environmental sound classification method based on Vision Transformer is proposed. We represent sound files with their image representations, namely Log Mel Spectrogram Images and train a Vision Transformer model on these image representations. Specifically, the method obtains an average classification accuracy of 94.6633%. The classification result reveals that the proposed approach is with a good performance on the ESC accuracy.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Haoran Hong, Junfeng Li, and Xingxing Li "Environmental sound classification using vision transformer", Proc. SPIE 12981, Ninth International Symposium on Sensors, Mechatronics, and Automation System (ISSMAS 2023), 129811Z (4 March 2024); https://doi.org/10.1117/12.3014876
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
Back to Top