In this talk, we propose a new approach for multimodal emotion recognition using cross-modal attention and raw waveform based convolutional neural networks. Our approach uses audio and text information to predict the emotion label. We use an audio encoder to process the raw audio waveform to extract high-level features from the audio, and we use a text encoder to extract high-level semantic information from text. We use cross-modal attention where the features from the audio encoder attend to the features from the text encoder and vice versa. This helps in developing interaction between speech and text sequences to extract the most relevant features for emotion recognition. Our experiments show that the proposed approach obtains the state of the art results on IEMOCAP dataset [1]. We obtain a 1.9% absolute improvement in accuracy compared to the previous state of the art method [2]. Our proposed approach uses 1D convolutional neural network to process the raw waveform instead of spectrogram features. Our experiments also show that processing raw waveform gives a 0.54% improvement over spectrogram based model.

Instructor's Bio


  • 01

    Multimodal Emotion Recognition using Cross-Modal Attention and 1D Convolutional Neural Networks

    • AI+ Training

    • Webinar Link

    • AI+ Subscription Plans