Connect with us

Hi, what are you looking for?

AI Generative

Researchers Benchmark 19 Multimodal Models for Open-Vocabulary Emotion Recognition

Researchers from Cambridge and Hunan University benchmarked 19 multimodal models, revealing a two-stage trimodal fusion method that optimizes emotion recognition accuracy by significantly leveraging video data.

Researchers from the University of Cambridge, Hunan University, and other institutions have made significant strides in understanding human emotion through artificial intelligence, unveiling the first large-scale evaluation of how effectively large multimodal models—capable of processing text, audio, and video—recognize emotional expressions in real-world contexts. This groundbreaking work, driven by Jing Han and colleagues, goes beyond identifying a restricted range of emotions to tackle the complex task of recognizing a broad spectrum of emotional cues, setting crucial benchmarks for the evolving field of emotional AI.

In their study, the team systematically tested 19 leading models, finding that the integration of audio, video, and text produced the most accurate results, with video being especially critical. The findings reveal that open-source models can compete closely with their closed-source counterparts, offering insights crucial for developing more sophisticated emotion recognition technologies.

Recent advancements in Large Language Models (LLMs) underscore the rapid evolution of multimodal AI, with significant contributions from tech giants like Google and Alibaba. Google’s Gemini, a family of models that process text, images, audio, and video, is designed to function as an agentic AI. In contrast, Alibaba’s Qwen series, which includes models for audio and language, has highlighted performance enhancements in its Qwen2.5 model and integrated reinforcement learning through its DeepSeek models to improve reasoning capabilities.

A key focus of current research is the development of prompting techniques aimed at enhancing LLM performance. Innovative strategies such as chain-of-thought prompting, self-consistency, and least-to-most prompting are being explored to refine the thought processes of these models. Direct Preference Optimization, another reinforcement learning technique, is also being applied to enhance the quality of model responses. The scope of this research extends beyond text, delving into multimodal understanding, particularly in video and audio processing. Models like LLaVA-Video and Tarsier2 are making strides in video comprehension, while Qwen-Audio aims for unified audio-language processing. Researchers are also investigating methods to improve temporal understanding in video LLMs and scale the performance of open-source multimodal models.

A comprehensive evaluation framework has been constructed using the OV-MERD dataset to assess the reasoning, fusion strategies, and prompt design of 19 mainstream multimodal large language models (MLLMs) in open-vocabulary emotion recognition. This extensive benchmarking reveals both the capabilities and limitations of current MLLMs in understanding nuanced emotional expressions. The study builds upon previous methodologies that used emotional clues, extending them with innovative architectures for enhanced performance.

Through experimentation, researchers determined that a two-stage trimodal fusion—which integrates audio, video, and text—achieves optimal performance in emotion recognition. Video was identified as the most critical modality, significantly enhancing accuracy compared to audio or text alone. In-depth analysis of prompt engineering indicated a surprisingly narrow performance gap between open-source and closed-source LLMs. This research has established essential benchmarks and offers actionable guidelines for advancing fine-grained affective computing.

Insights from this research suggest that while complex reasoning models have their advantages, they may not necessarily outperform simpler models when it comes to direct emotion identification. The study highlights the need for future work to focus on comprehensive datasets, multilingual evaluations, and more sophisticated multimodal fusion techniques, which could further refine AI’s ability to understand and interpret emotions. As this field continues to evolve, its applications in domains such as medical education and emotional recognition promise significant advancements in human-computer interaction.

👉 More information
🗞 Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies
🧠 ArXiv: https://arxiv.org/abs/2512.20938

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Regulation

Clarkesworld halts new submissions amid a surge of AI-generated stories, prompting industry-wide adaptations as publishers face unprecedented content challenges.

AI Technology

Donald Thompson of Workplace Options emphasizes the critical role of psychological safety in AI integration, advocating for human-centered leadership to enhance organizational culture.

AI Tools

KPMG fines a partner A$10,000 for using AI to cheat in internal training, amid a trend of over two dozen staff caught in similar...

Top Stories

IBM faces investor scrutiny as its stock trades 24% below target at $262.38, despite launching new AI products and hiring for next-gen skills.

AI Finance

Apollo Global Management reveals a $40 trillion vision for private credit and anticipates $5-$7 trillion in AI funding over the next five years at...

AI Cybersecurity

Seventy percent of firms in Dubai are prioritizing AI, projected to drive the cybersecurity market to $23.54 billion with a 14.55% growth this year.

Top Stories

Expedia Group reports 11% Q4 revenue growth to $3.5 billion, fueled by AI-driven travel discovery and a 24% surge in B2B bookings to $8.7...

AI Regulation

Kraken integrates AI-driven compliance tools, enhancing efficiency and decision-making speed while ensuring regulatory adherence across global markets.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.