Paper
2 May 2023 Speaker diarization based on multi-timescale feature fusion
Futian Wang, Lailong Chen
Author Affiliations +
Proceedings Volume 12642, Second International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2023); 126422K (2023) https://doi.org/10.1117/12.2674733
Event: Second International Conference on Electronic Information Engineering, Big Data and Computer Technology (EIBDCT 2023), 2023, Xishuangbanna, China
Abstract
Deep and convolutional neural networks have performed well in capturing speaker characteristics, while the ECAPA-TDNN model has demonstrated outstanding performance in both the fields of speaker validation and speaker diarization. Within this essay, during the speech segmentation stage, we uniformly redivide the speech on multiple time scales based on the oracle voice activity detection. Meanwhile, we fine-tune the ECAPA-TDNN architecture by adding a RepVGG module to extract more abundant features, then aggregate all of the outputs. Finally, we use DOVER-Lap to integrate the results obtained after the clustering of multiple schemes in a way to obtain the final temporal labeling. The best results achieves 1.91% of the diarization error rate on the classical AMI conference corpus.
© (2023) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Futian Wang and Lailong Chen "Speaker diarization based on multi-timescale feature fusion", Proc. SPIE 12642, Second International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2023), 126422K (2 May 2023); https://doi.org/10.1117/12.2674733
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Feature fusion

Data modeling

Matrices

Education and training

Feature extraction

Interference (communication)

Video

Back to Top