Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features

Jennifer Santoso*, Takeshi Yamada, Kenkichi Ishizuka, Taiichi Hashimoto, Shoji Makino

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)

Abstract

Speech emotion recognition (SER) is essential for understanding a speaker's intention. Recently, some groups have attempted to improve SER performance using a bidirectional long short-term memory (BLSTM) to extract features from speech sequences and a self-attention mechanism to focus on the important parts of the speech sequences. SER also benefits from combining the information in speech with text, which can be accomplished automatically using an automatic speech recognizer (ASR), further improving its performance. However, ASR performance deteriorates in the presence of emotion in speech. Although there is a method to improve ASR performance in the presence of emotional speech, it requires the fine-tuning of ASR, which has a high computational cost and leads to the loss of cues important for determining the presence of emotion in speech segments, which can be helpful in SER. To solve these problems, we propose a BLSTM-and-self-attention-based SER method using self-attention weight correction (SAWC) with confidence measures. This method is applied to acoustic and text feature extractors in SER to adjust the importance weights of speech segments and words with a high possibility of ASR error. Our proposed SAWC reduces the importance of words with speech recognition error in the text feature while emphasizing the importance of speech segments containing these words in acoustic features. Our experimental results on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset reveal that our proposed method achieves a weighted average accuracy of 76.6%, outperforming other state-of-the-art methods. Furthermore, we investigated the behavior of our proposed SAWC in each of the feature extractors.

Original languageEnglish
Pages (from-to)115732-115743
Number of pages12
JournalIEEE Access
Volume10
DOIs
Publication statusPublished - 2022

Keywords

  • Speech emotion recognition
  • automatic speech recognition
  • confidence measure
  • self-attention mechanism

ASJC Scopus subject areas

  • General Engineering
  • General Computer Science
  • General Materials Science

Fingerprint

Dive into the research topics of 'Speech Emotion Recognition Based on Self-Attention Weight Correction for Acoustic and Text Features'. Together they form a unique fingerprint.

Cite this