Mistral AI has unveiled a comprehensive investigation into a memory leak issue impacting its vLLM (virtual Large Language Model) framework, which came to light during pre-production testing of its Mistral Medium 3.1 model. The leak, characterized by a steady increase in memory consumption, emerged under specific conditions, namely during disaggregated serving with graph compilation enabled, posing a significant risk of causing an “out of memory” state after a few hours of operation. The investigation, detailed in Mistral’s new Engineering Deep Dive series, explores the complexities involved in identifying the root cause of such an elusive issue.
The investigation began with a systematic approach, leveraging Python memory profiling tools before transitioning to more advanced methods, including kernel-level tracing. Initial attempts using tools like Memray and Guppy 3 yielded no results, prompting the team to engage with the vLLM community through GitHub, confirming other users had experienced similar issues.
As the team delved deeper, they employed Heaptrack, a memory profiler that captures memory operation events. This tool revealed that while the heap memory remained stable, the peak resident memory (RSS) indicated discrepancies, suggesting the leak occurred outside the analyzed heap space. Subsequent monitoring with the pmap command highlighted that only certain anonymous memory mappings were continuously growing, potentially linked to the system calls for memory resizing, such as mremap.
To further clarify the source of the leak, the team utilized BPFtrace, a tool for real-time tracing of system calls. This approach confirmed that the leak was associated with mmap calls rather than mremap, with each allocation traced back to the glibc syscall wrapper. However, the challenge remained in pinpointing the exact call site leading to the growing memory allocation.
Through targeted automation of GDB, the team set conditional breakpoints on the syscall address, enabling real-time analysis of memory allocations. This process ultimately revealed that the memory leak was attributable to UCX (Unified Communication X), a high-performance communication library employed for data transfer optimizations. The library’s broad interception of mmap calls, particularly for InfiniBand memory management, led to improperly released memory regions that accumulated over time.
Through collaboration with teams from vLLM and UCX, Mistral AI identified a solution: disabling the memory hooking mechanism by setting the environment variable UCX_MEM_MMAP_HOOK_MODE=none. This adjustment mitigated the memory leak while preserving system performance. The team also recognized that while UCX employs a registration cache for InfiniBand operations, its cleaning mechanism had not been triggered under certain conditions, leading to the accumulation of unreleased memory.
This investigation illustrates the intricacies involved in diagnosing issues within modern software ecosystems, where multiple layers of dependencies can obscure the source of performance problems. Mistral AI’s experience underscores the importance of collaboration and transparency in addressing such challenges, highlighting the need for continuous refinement in dependency management practices.
See also
Mila and Inovia Launch $100M Fund to Propel 55 AI Startups from Research to Market
Germany”s National Team Prepares for World Cup Qualifiers with Disco Atmosphere
95% of AI Projects Fail in Companies According to MIT
AI in Food & Beverages Market to Surge from $11.08B to $263.80B by 2032
Satya Nadella Supports OpenAI’s $100B Revenue Goal, Highlights AI Funding Needs


















































