Multimodal Large Language Model-Action Unit Approach for Mixed Emotion Descriptors

Authors
Kim, CheolheeJi, SeungyeonHan, Kyungreem
Issue Date
2025-10-08
Publisher
IEEE Systems, Man and Cybernetics Society
Citation
IEEE International Conference on Systems, Man, and Cybernetics (SMC)
Abstract
Facial Expression Recognition (FER) allows computers to identify emotional expressions depicted on a human face. While recent vision-language models have demonstrated remarkable performance across various single-emotion-FER tasks, often outperforming human-level, they usually fail mixed emotion cases. This study describes a Multimodal Large Language Model (MLLM) approach (i.e., Emo-AU (Action Unit) LLM) for mixed emotion detection using the FERPlus dataset. The Emo-AU LLM uses a cross-modal attention mechanism that relates AU and visual features to circumvent the limitations of current visual encoders, which rely on coarse facial features. Our model achieved an accuracy of 98.30% for the single emotion detection, and it obtained accuracies of 89.76% (major emotion: more labeled by 10 annotators) and 82.42% (minor emotion: less labeled) for the two-emotion cases. The model also explains the reasoning behind the predictions in the benefit of a large language model. This study lays the foundations for understanding facial expressions in context―considering the surrounding situation and social interaction, rather than solely relying on the human face in real and generated image/video.
URI

Go to Link
Appears in Collections:
KIST Conference Paper > 2025
Export
RIS (EndNote)
XLS (Excel)
XML

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE