In late 2025, Meituan, China’s leading super-app for food delivery, travel, and local services, made headlines in the AI community by open-sourcing LongCat-Image, a groundbreaking bilingual (Chinese-English) foundation model designed for text-to-image generation and editing. With just 6 billion parameters, the lightweight diffusion-based model delivers photorealistic outputs, exceptional text rendering, and production-ready performance that rivals or even exceeds that of much larger open-source and proprietary systems.
Developed by Meituan’s dedicated LongCat AI team, LongCat-Image emerges amidst fierce competition in multimodal AI from Chinese tech giants such as Alibaba, Tencent, and ByteDance. It distinguishes itself by prioritizing efficiency, real-world usability, and multilingual capability over sheer parameter count — a pivot away from the prevailing notion that “bigger is better” in the realm of AI image generation.
This model, built on a hybrid Multimodal Diffusion Transformer (MM-DiT) architecture, requires approximately 17-18 GB of VRAM for inference, enabling faster generation times and lower deployment costs than models boasting over 20 billion parameters. Benchmarks confirm its prowess: as of December 2025, LongCat-Image ranks second among all open-source models on T2I-CoreBench, trailing only the larger 32 billion parameter Flux2.dev. It achieves state-of-the-art results on various image editing benchmarks and scores an impressive 90.7 on the ChineseWord benchmark, demonstrating high accuracy in rendering all 8,105 standard Chinese characters.
The success of LongCat-Image can be attributed to meticulous data curation rather than brute-force scaling. The LongCat AI team employed rigorous filtering to exclude low-quality and AI-generated images while implementing a multi-stage refinement process that includes pre-training, mid-training, supervised fine-tuning, and reinforcement learning with reward models. A specialized character-level encoding strategy further enhances legibility and fidelity in both languages, particularly when text appears in quotation marks.
LongCat-Image excels in several key areas. It demonstrates superior bilingual text rendering, embedding accurate and stable Chinese and English text into images, which has historically posed challenges for diffusion models. This capability is especially valuable for applications in e-commerce, advertising, and cross-border design workflows. The model also produces studio-grade visuals characterized by believable lighting, accurate textures, and physics-aware object placement, earning high scores in subjective mean opinion tests for realism and compositional quality.
The LongCat family includes several variants: LongCat-Image, the core text-to-image model; LongCat-Image-Edit, designed for instruction-based editing; and LongCat-Image-Dev, which serves mid-training checkpoints for fine-tuning and research. All variants support Diffusers integration, LoRA adapters, ComfyUI workflows, and a full training code release under the Apache 2.0 license, significantly lowering barriers for developers to customize styles and build production pipelines.
Meituan’s focus on real-world applications is evident in the model’s design, which aims for fast, high-resolution output, batch generation for marketing assets, and reliable instruction following without layout drift. Early user feedback from platforms like Reddit highlights LongCat-Image’s editing consistency, which stands in contrast to competitors that often distort characters during refinements.
LongCat-Image is already powering features in Meituan’s own applications and web platforms, underscoring its immediate commercial viability. The release, accompanied by a technical report and resources on platforms like Hugging Face and GitHub, signals Meituan’s ambition to cultivate an open, collaborative AI ecosystem in China. In an industry where many models are closed-source and resource-intensive, LongCat-Image showcases how intelligent architecture and clean data can yield high-performance results at a fraction of the cost. For developers and businesses requiring reliable bilingual image tools, particularly in handling Chinese content, this 6 billion parameter model may well become a new standard in open-source offerings.
Explore it today at: Meituan LongCat-Image or on Hugging Face. With all weights, pipelines, and documentation readily available, the era of accessible, high-fidelity bilingual image AI has received a significant upgrade.
See also
Sam Altman Praises ChatGPT for Improved Em Dash Handling
AI Country Song Fails to Top Billboard Chart Amid Viral Buzz
GPT-5.1 and Claude 4.5 Sonnet Personality Showdown: A Comprehensive Test
Rethink Your Presentations with OnlyOffice: A Free PowerPoint Alternative
OpenAI Enhances ChatGPT with Em-Dash Personalization Feature

















































