Connect with us

Hi, what are you looking for?

AI Generative

Researchers Benchmark 19 Multimodal Models for Open-Vocabulary Emotion Recognition

Researchers from Cambridge and Hunan University benchmarked 19 multimodal models, revealing a two-stage trimodal fusion method that optimizes emotion recognition accuracy by significantly leveraging video data.

Researchers from the University of Cambridge, Hunan University, and other institutions have made significant strides in understanding human emotion through artificial intelligence, unveiling the first large-scale evaluation of how effectively large multimodal models—capable of processing text, audio, and video—recognize emotional expressions in real-world contexts. This groundbreaking work, driven by Jing Han and colleagues, goes beyond identifying a restricted range of emotions to tackle the complex task of recognizing a broad spectrum of emotional cues, setting crucial benchmarks for the evolving field of emotional AI.

In their study, the team systematically tested 19 leading models, finding that the integration of audio, video, and text produced the most accurate results, with video being especially critical. The findings reveal that open-source models can compete closely with their closed-source counterparts, offering insights crucial for developing more sophisticated emotion recognition technologies.

Recent advancements in Large Language Models (LLMs) underscore the rapid evolution of multimodal AI, with significant contributions from tech giants like Google and Alibaba. Google’s Gemini, a family of models that process text, images, audio, and video, is designed to function as an agentic AI. In contrast, Alibaba’s Qwen series, which includes models for audio and language, has highlighted performance enhancements in its Qwen2.5 model and integrated reinforcement learning through its DeepSeek models to improve reasoning capabilities.

A key focus of current research is the development of prompting techniques aimed at enhancing LLM performance. Innovative strategies such as chain-of-thought prompting, self-consistency, and least-to-most prompting are being explored to refine the thought processes of these models. Direct Preference Optimization, another reinforcement learning technique, is also being applied to enhance the quality of model responses. The scope of this research extends beyond text, delving into multimodal understanding, particularly in video and audio processing. Models like LLaVA-Video and Tarsier2 are making strides in video comprehension, while Qwen-Audio aims for unified audio-language processing. Researchers are also investigating methods to improve temporal understanding in video LLMs and scale the performance of open-source multimodal models.

A comprehensive evaluation framework has been constructed using the OV-MERD dataset to assess the reasoning, fusion strategies, and prompt design of 19 mainstream multimodal large language models (MLLMs) in open-vocabulary emotion recognition. This extensive benchmarking reveals both the capabilities and limitations of current MLLMs in understanding nuanced emotional expressions. The study builds upon previous methodologies that used emotional clues, extending them with innovative architectures for enhanced performance.

Through experimentation, researchers determined that a two-stage trimodal fusion—which integrates audio, video, and text—achieves optimal performance in emotion recognition. Video was identified as the most critical modality, significantly enhancing accuracy compared to audio or text alone. In-depth analysis of prompt engineering indicated a surprisingly narrow performance gap between open-source and closed-source LLMs. This research has established essential benchmarks and offers actionable guidelines for advancing fine-grained affective computing.

Insights from this research suggest that while complex reasoning models have their advantages, they may not necessarily outperform simpler models when it comes to direct emotion identification. The study highlights the need for future work to focus on comprehensive datasets, multilingual evaluations, and more sophisticated multimodal fusion techniques, which could further refine AI’s ability to understand and interpret emotions. As this field continues to evolve, its applications in domains such as medical education and emotional recognition promise significant advancements in human-computer interaction.

👉 More information
🗞 Pioneering Multimodal Emotion Recognition in the Era of Large Models: From Closed Sets to Open Vocabularies
🧠 ArXiv: https://arxiv.org/abs/2512.20938

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Business

The global software development market is projected to surge from $532.65 billion in 2024 to $1.46 trillion by 2033, driven by AI and cloud...

AI Technology

AI is transforming accounting by 2026, with firms like BDO leveraging intelligent systems to enhance client relationships and drive predictable revenue streams.

AI Generative

Instagram CEO Adam Mosseri warns that the surge in AI-generated content threatens authenticity, compelling users to adopt skepticism as trust erodes.

AI Tools

Over 60% of U.S. consumers now rely on AI platforms for primary digital interactions, signaling a major shift in online commerce and user engagement.

AI Government

India's AI workforce is set to double to over 1.25 million by 2027, but questions linger about workers' readiness and job security in this...

AI Education

EDCAPIT secures $5M in Seed funding, achieving 120K page views and expanding its educational platform to over 30 countries in just one year.

Top Stories

Health care braces for a payment overhaul as only 3 out of 1,357 AI medical devices secure CPT codes amid rising pressure for reimbursement...

Top Stories

DeepSeek introduces the groundbreaking mHC method to enhance the scalability and stability of language models, positioning itself as a major AI contender.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.