In this project, we have performed multimodal emotion recognition on the publicly available MELD dataset, sourced from Kaggle. Along with the dataset, various encodings (text, audio, and bimodal fusion) were already generated and made available. Due to the absence of audio files, we utilized the pre-computed audio encodings. Although we attempted to generate video encodings, we were unsuccessful due to hardware limitations.
The MELD (Multimodal Emotion Lines Dataset) is a publicly available dataset. The industry benchmark for this dataset is:
- Accuracy: 68.7%
- F1 Score: 69.9
Our fusion model achieved:
- Accuracy: 54.75%
- F1 Score: 47.7
Additionally, we worked with a custom dataset, called MELD Embeddings where we generated and stored required encodings. The training set consists of 9989 examples, categorized into 7 emotion classes.
- Encodings were generated by passing each example through a BERT model in a one-shot manner.
- Each encoding was compressed into a 1D array of size 98304.
- Using a custom autoencoder, the tensor size was reduced to 1611.
- The autoencoder's encoder and decoder layers utilized:
- Deep neural networks
- ReLU activation (except the final layer)
- Dropout for regularization
- The pre-computed audio encodings from the MELD dataset were used (size: 1611 per encoding), as the raw audio files were not available.
-
The audio and text encodings were concatenated, resulting in encoding tensors of size 3222 for each example.
-
A Random Forest Classifier was applied with:
- 100 estimators
- Default hyperparameters
- Accuracy: 55.41%
- F1 Score: 46.61
-
The fused encodings were further reduced to a size of 64 using another custom autoencoder.
- Encoder and decoder layers used:
- Deep neural networks
- ReLU activation (except the final layer)
- Dropout for regularization
- Encoder and decoder layers used:
-
After encoding reduction, the Random Forest Classifier achieved:
- Accuracy: 54.75%
- F1 Score: 47.7
-
Other classifiers, such as XGBoost, AdaBoost, and Gradient Boost, were also tested, but the Random Forest Classifier provided the best results.
Access to more hardware and the ability to generate video encodings would likely improve accuracy and F1 scores. Further exploration of models and extensive hyperparameter tuning could also enhance performance.
This project has potential future applications, such as analyzing video and audio data to determine a person's emotions or sentiments. This could help assess whether someone is lying, has malicious intentions, or other psychological cues.