December 17, 2025, 10:17 pm IST — The Hao AI Lab at the University of California San Diego is set to enhance the landscape of low-latency large language model (LLM) serving with the recent acquisition of an NVIDIA DGX B200 system. This strategic investment is anticipated to significantly expedite research focused on LLM inference, addressing the pressing demand for swift and efficient generative AI responses.
The Hao AI Lab, known for its pivotal contributions to the field, including foundational research like DistServe, is poised to leverage this advanced hardware to refine its methodologies further. The acquisition aligns with ongoing efforts to improve the speed and efficiency of AI interactions, a crucial aspect as generative AI becomes increasingly integrated into everyday applications.
At the heart of the lab’s research is the concept of “goodput,” a nuanced metric that evaluates LLM serving performance beyond traditional throughput measures, which quantify tokens generated per second. While throughput offers insights into system efficiency, it often overlooks the user experience, leading to delays that can undermine AI interactions. In contrast, goodput measures throughput against user-defined latency targets, ensuring high efficiency without compromising response quality.
This redefinition of performance metrics recognizes the critical need for rapid and consistent responses, particularly in real-world applications such as chatbots, coding assistants, and creative tools. For users, the interval between input and the first token can be a decisive factor in their interaction with AI systems. By focusing on goodput, developers can create more responsive AI experiences that remain economically viable.
Technical Details
The Hao AI Lab’s approach hinges on a pioneering technique known as prefill/decode disaggregation. Traditionally, the prefill phase—responsible for processing user input to generate the first token—and the decode phase—generating subsequent tokens—occur concurrently on the same GPU. This conventional method leads to resource contention, as the prefill process is compute-intensive while decoding is memory-intensive. By decoupling these tasks onto separate GPUs, the lab’s researchers have found a way to enhance performance by eliminating interference between the two processes.
This innovative disaggregated inference technique fundamentally elevates the responsiveness of LLM serving. By assigning distinct hardware resources to each phase, systems can scale workloads continuously without sacrificing low latency or the quality of model responses. The NVIDIA Dynamo, an open-source framework, integrates this disaggregated inference method, providing developers with the tools to build highly efficient and responsive generative AI applications. With the DGX B200 system, the Hao AI Lab is empowered to refine these methods further, exploring the next generation of real-time LLM capabilities.
The ramifications for the broader industry are substantial. As the integration of LLMs into mainstream applications accelerates, the demand for instantaneous and seamless user interactions will intensify. Research initiatives at UC San Diego, supported by cutting-edge hardware, signify not just incremental progress but a potential redefinition of user expectations from AI systems. This work aims to advance the development of truly conversational AI, minimizing the time between input and response to near invisibility, and unlocking new opportunities across diverse sectors, from healthcare to entertainment.
See also
Google Research Advances AI to Amplify Human Ingenuity and Drive Real-World Impact
New Protocol Reduces Communication Complexity in Distributed Estimation by 50%
AI-Driven Research Boosts Productivity by 89% but Raises Concerns Over Quality, Study Finds
Deep Learning Boosts Chiral Metasurfaces, Doubling Dichroism for Advanced Optical Devices
vConTACT3 Launches, Achieves 95% Accuracy in Scalable Virus Classification


















































