Abstract:
Emotion recognition from speech has become a significant research area due to its poten tial applications in diverse fields such as mental health, human-robot interaction, and virtual
assistants. The work presents an approach to classify emotions based on gender into 14 dif ferent classes by concatenating four emotional speech datasets: RAVDESS, SAVEE, TESS,
and CREMA-D. Multiple features, including MFCCs, spectral contrast, pitch, centroid, roll off, onset flux, entropy, and ZCR, are extracted from the audio files, and these features are
utilized as inputs to a Temporal Convolutional Network (TCN) for emotion classification.
TCN is trained to learn high-level features from the extracted features of the four datasets,
and these features are then utilized for the classification of emotions. In this approach,
different combinations of feature sets and the TCN classifier are evaluated to identify the
optimal combination that achieves the highest training and validation accuracies. The com bination of MFCC and entropy achieved the highest accuracy rates, with 99.4% and 99.01%
for training and validation, respectively. The other combinations of features also achieved
high accuracy rates, with some variations in performance. The proposed methodology is
effective in accurately classifying emotions from speech, and the use of TCN in conjunction
with a variety of feature sets, including the extraction of high-level features, significantly
improved the performance of the emotion classification system.