Researchers from the University of Cambridge, Hunan University, and other institutions have made significant strides in understanding human emotion through artificial intelligence, unveiling the first large-scale evaluation of how effectively large multimodal models—capable of processing text, audio, and video—recognize emotional expressions in real-world contexts. This groundbreaking work, driven by Jing Han and colleagues, goes beyond identifying a restricted range of emotions to tackle the complex task of recognizing a broad spectrum of emotional cues, setting crucial benchmarks for the evolving field of emotional AI.
In their study, the team systematically tested 19 leading models, finding that the integration of audio, video, and text produced the most accurate results, with video being especially critical. The findings reveal that open-source models can compete closely with their closed-source counterparts, offering insights crucial for developing more sophisticated emotion recognition technologies.
Recent advancements in Large Language Models (LLMs) underscore the rapid evolution of multimodal AI, with significant contributions from tech giants like Google and Alibaba. Google’s Gemini, a family of models that process text, images, audio, and video, is designed to function as an agentic AI. In contrast, Alibaba’s Qwen series, which includes models for audio and language, has highlighted performance enhancements in its Qwen2.5 model and integrated reinforcement learning through its DeepSeek models to improve reasoning capabilities.
A key focus of current research is the development of prompting techniques aimed at enhancing LLM performance. Innovative strategies such as chain-of-thought prompting, self-consistency, and least-to-most prompting are being explored to refine the thought processes of these models. Direct Preference Optimization, another reinforcement learning technique, is also being applied to enhance the quality of model responses. The scope of this research extends beyond text, delving into multimodal understanding, particularly in video and audio processing. Models like LLaVA-Video and Tarsier2 are making strides in video comprehension, while Qwen-Audio aims for unified audio-language processing. Researchers are also investigating methods to improve temporal understanding in video LLMs and scale the performance of open-source multimodal models.
A comprehensive evaluation framework has been constructed using the OV-MERD dataset to assess the reasoning, fusion strategies, and prompt design of 19 mainstream multimodal large language models (MLLMs) in open-vocabulary emotion recognition. This extensive benchmarking reveals both the capabilities and limitations of current MLLMs in understanding nuanced emotional expressions. The study builds upon previous methodologies that used emotional clues, extending them with innovative architectures for enhanced performance.
Through experimentation, researchers determined that a two-stage trimodal fusion—which integrates audio, video, and text—achieves optimal performance in emotion recognition. Video was identified as the most critical modality, significantly enhancing accuracy compared to audio or text alone. In-depth analysis of prompt engineering indicated a surprisingly narrow performance gap between open-source and closed-source LLMs. This research has established essential benchmarks and offers actionable guidelines for advancing fine-grained affective computing.
Insights from this research suggest that while complex reasoning models have their advantages, they may not necessarily outperform simpler models when it comes to direct emotion identification. The study highlights the need for future work to focus on comprehensive datasets, multilingual evaluations, and more sophisticated multimodal fusion techniques, which could further refine AI’s ability to understand and interpret emotions. As this field continues to evolve, its applications in domains such as medical education and emotional recognition promise significant advancements in human-computer interaction.
👉 More information
🗞 Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies
🧠 ArXiv: https://arxiv.org/abs/2512.20938
AI Models Accelerate Drug Discovery by Generating Novel Hit-Like Compounds
New Current-Diffusion Model Enhances Metasurface Discovery with Spatial-Frequency Dynamics
LLMs Achieve Over 99% Accuracy as World Models for AI Agent Training, Study Reveals
Top10Lists.us Validates AI Source Credibility for Real Estate Agent Recommendations
LG Launches Gram Pro AI Laptops with Exaone 3.5 and 27-Hour Battery Life at CES 2026



















































