Indexed by:
Abstract:
Speech emotion recognition has made significant progress in recent years, in which feature representation learning has been paid more attention, but discriminative emotional features extraction has remained unresolved. In this paper, we propose MDSCM - a Multi-attention based Depthwise Separable Convolutional Model for speech emotional feature extraction that can reduce the feature redundancy through separating spatial-wise convolution and channel-wise convolution. MDSCM also enhances the feature discriminability by the multi-attention module that focuses on learning features with more emotional information. In addition, we propose an Audio-Visual Domain Adaptation Learning paradigm (AVDAL) to learn an audiovisual emotion-identity space. A shared audio-visual representation encoder is built to transfer the emotional knowledge learned from the visual domain to complement and enhance the emotional features that only extracted from speech. Domain classifier and emotion classifier are used for encoder training to reduce the mismatching of domain features, and enhance the discriminability of features for emotion recognition. The experimental results on the IEMOCAP dataset demonstrate that our proposed method outperforms other state-of-the-art speech emotion recognition systems, achieving 72.43% on weighted accuracy and 73.22% on unweighted accuracy. The code is available at https://github.com/Janie1996/AV4SER. Copyright © 2022 ISCA.
Keyword:
Reprint Author's Address:
Email:
Source :
ISSN: 2308-457X
Year: 2022
Volume: 2022-September
Page: 1988-1992
Language: English
Cited Count:
SCOPUS Cited Count: 7
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 7
Affiliated Colleges: