Large language models (LLMs) are transforming the financial trading sector by facilitating advanced analysis of unstructured data, which provides actionable insights for traders. These sophisticated AI systems evaluate a broad spectrum of information, including financial news, social media sentiment, and market data, to forecast stock price fluctuations and automate investment strategies with exceptional precision.
The Strategic Technology Analysis Center (STAC) has spent over 15 years developing benchmarks tailored for the financial industry. They recently introduced the STAC-AI benchmark, aimed at helping firms evaluate the end-to-end retrieval-augmented generation (RAG) and LLM inference pipeline, a critical component in deploying these technologies successfully.
In the context of this advancement, the STAC-AI LANG6 benchmark focuses on LLM inference performance, specifically examining the hardware and software stack using the Llama 3.1 8B Instruct and Llama 3.1 70B Instruct models. It incorporates custom datasets derived from EDGAR filings that model summarization tasks relevant to financial trading and investment advice.
The benchmark evaluates two inference scenarios: batch mode and interactive mode. In batch mode, all requests are processed simultaneously, measuring throughput; in interactive mode, requests arrive at random intervals, assessing metrics such as reaction time and words per second per user. Notably, interactive mode does not include the combination of the Llama 3.1 70B Instruct model with the EDGAR5 dataset, creating a focused environment for testing.
This evaluation is critical as the quality of output is compared to baseline LLM-generated responses, with the benchmark emphasizing the need for specific preprocessing steps that might be better suited for server-side execution in real-world applications, thereby placing additional demands on CPU resources.
Technical Details
The recent analysis compares two on-premises NVIDIA Hopper-based servers provided by HPE with a cloud-based NVIDIA Blackwell node. As part of the benchmarking procedures, post-training quantization was required. The models were quantized using the NVIDIA TensorRT Model Optimizer, with different quantization formats applied for Hopper and Blackwell to optimize performance. The TensorRT LLM inference framework was utilized to ensure efficient model execution while maintaining a familiar PyTorch development environment.
Benchmarking results from both batch and interactive modes reveal significant advantages for the NVIDIA Blackwell architecture, which outperformed its competitors across all scenarios in batch mode. For example, the Llama 3.1 8B model achieved a words per second (WPS) rate of 37,480 with the EDGAR4 dataset on the Blackwell node, demonstrating the model’s robust capabilities in processing financial data efficiently.
Furthermore, single-GPU performance assessments indicated a throughput advantage for the Blackwell architecture, with reported performance improvements of up to 3.2 times when compared to the previous generation of GPUs. This performance uplift positions NVIDIA’s newer models as pivotal tools in handling high-volume financial data processing.
In interactive mode, the balance between token economics and user experience becomes paramount. The analysis illustrated how the Blackwell NVL72 managed to maintain a favorable trade-off between throughput and both reaction time and inter-word latency across various model and dataset configurations. This highlights not only the model’s raw computational power but also its ability to deliver a responsive user experience, a critical factor in financial applications where timing can significantly influence decision-making.
Even when throughput levels were matched, the Blackwell architecture consistently outperformed the Hopper servers in terms of both reaction time and inter-word latency, showcasing its superiority in maintaining performance under load.
As firms seek to implement these advanced technologies, understanding how to benchmark models against specific dataset characteristics remains crucial. A guide for benchmarking TensorRT LLM with customized data has been made available, outlining steps for quantizing models and preparing datasets to suit particular use cases. This includes launching containers with the necessary dependencies, quantizing models using the NVIDIA Model Optimizer, and generating synthetic datasets to simulate real-world conditions.
Ultimately, the advancements represented by the NVIDIA GB200 NVL72 in the STAC-AI LANG6 benchmark signify a new horizon for LLM inference within the financial sector. Delivering up to 3.2 times the performance of older architectures, these new models not only achieve higher throughput but also ensure superior interactivity, making them invaluable assets for financial institutions seeking to leverage AI-driven insights.
While NVIDIA’s Hopper architecture continues to perform well, even three years post-launch, the enhancements of the Blackwell generation affirm the ongoing evolution of LLM technologies. As firms increasingly adopt AI solutions for trading strategies, the insights gained from this benchmarking will guide future investments and innovations in LLM applications.
See also
Finance Ministry Alerts Public to Fake AI Video Featuring Adviser Salehuddin Ahmed
Bajaj Finance Launches 200K AI-Generated Ads with Bollywood Celebrities’ Digital Rights
Traders Seek Credit Protection as Oracle’s Bond Derivatives Costs Double Since September
BiyaPay Reveals Strategic Upgrade to Enhance Digital Finance Platform for Global Users
MVGX Tech Launches AI-Powered Green Supply Chain Finance System at SFF 2025



















































