Connect with us

Hi, what are you looking for?

AI Generative

Researchers Benchmark 19 Multimodal Models for Open-Vocabulary Emotion Recognition

Researchers from Cambridge and Hunan University benchmarked 19 multimodal models, revealing a two-stage trimodal fusion method that optimizes emotion recognition accuracy by significantly leveraging video data.

Researchers from the University of Cambridge, Hunan University, and other institutions have made significant strides in understanding human emotion through artificial intelligence, unveiling the first large-scale evaluation of how effectively large multimodal models—capable of processing text, audio, and video—recognize emotional expressions in real-world contexts. This groundbreaking work, driven by Jing Han and colleagues, goes beyond identifying a restricted range of emotions to tackle the complex task of recognizing a broad spectrum of emotional cues, setting crucial benchmarks for the evolving field of emotional AI.

In their study, the team systematically tested 19 leading models, finding that the integration of audio, video, and text produced the most accurate results, with video being especially critical. The findings reveal that open-source models can compete closely with their closed-source counterparts, offering insights crucial for developing more sophisticated emotion recognition technologies.

Recent advancements in Large Language Models (LLMs) underscore the rapid evolution of multimodal AI, with significant contributions from tech giants like Google and Alibaba. Google’s Gemini, a family of models that process text, images, audio, and video, is designed to function as an agentic AI. In contrast, Alibaba’s Qwen series, which includes models for audio and language, has highlighted performance enhancements in its Qwen2.5 model and integrated reinforcement learning through its DeepSeek models to improve reasoning capabilities.

A key focus of current research is the development of prompting techniques aimed at enhancing LLM performance. Innovative strategies such as chain-of-thought prompting, self-consistency, and least-to-most prompting are being explored to refine the thought processes of these models. Direct Preference Optimization, another reinforcement learning technique, is also being applied to enhance the quality of model responses. The scope of this research extends beyond text, delving into multimodal understanding, particularly in video and audio processing. Models like LLaVA-Video and Tarsier2 are making strides in video comprehension, while Qwen-Audio aims for unified audio-language processing. Researchers are also investigating methods to improve temporal understanding in video LLMs and scale the performance of open-source multimodal models.

A comprehensive evaluation framework has been constructed using the OV-MERD dataset to assess the reasoning, fusion strategies, and prompt design of 19 mainstream multimodal large language models (MLLMs) in open-vocabulary emotion recognition. This extensive benchmarking reveals both the capabilities and limitations of current MLLMs in understanding nuanced emotional expressions. The study builds upon previous methodologies that used emotional clues, extending them with innovative architectures for enhanced performance.

Through experimentation, researchers determined that a two-stage trimodal fusion—which integrates audio, video, and text—achieves optimal performance in emotion recognition. Video was identified as the most critical modality, significantly enhancing accuracy compared to audio or text alone. In-depth analysis of prompt engineering indicated a surprisingly narrow performance gap between open-source and closed-source LLMs. This research has established essential benchmarks and offers actionable guidelines for advancing fine-grained affective computing.

Insights from this research suggest that while complex reasoning models have their advantages, they may not necessarily outperform simpler models when it comes to direct emotion identification. The study highlights the need for future work to focus on comprehensive datasets, multilingual evaluations, and more sophisticated multimodal fusion techniques, which could further refine AI’s ability to understand and interpret emotions. As this field continues to evolve, its applications in domains such as medical education and emotional recognition promise significant advancements in human-computer interaction.

👉 More information
🗞 Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies
🧠 ArXiv: https://arxiv.org/abs/2512.20938

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Regulation

California Governor Gavin Newsom orders a review of AI supply-chain risk designations, impacting San Francisco's Anthropic amidst military contract disputes.

AI Government

Microsoft commits $10 billion to Japan's AI and cybersecurity sectors by 2029, aiming to train one million engineers and enhance data security and infrastructure.

AI Technology

Harvard study reveals that 94% of professionals see AI as crucial for cybersecurity, yet many firms risk reputational damage by neglecting strategic training.

Top Stories

Microsoft shifts to independent AI development, targeting state-of-the-art models by 2027, fueled by Nvidia chips and a new strategic focus.

AI Finance

AI banking experts highlight JPMorgan Chase and Bank of America's automation success, driving operational efficiency and customer loyalty amid rising cyber threats.

AI Education

Vietnamese universities are restructuring curricula to integrate AI as a core competency, addressing the 40% job impact from AI by 2030 and enhancing student...

Top Stories

DeepSeek forecasts Nvidia's stock will surge 50% to $265 by 2026, driven by new technology and strong institutional confidence amid market challenges.

AI Generative

Google launches Gemma 4, an open-source AI suite with 26B and 31B models for local deployment, enhancing privacy and multimodal reasoning capabilities.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.