Nvidia splits AI workloads with new Rubin CPX chip

Nvidia's new inference-only GPU locks customers deeper into its ecosystem.

Paul Mah

10 Sep 2025 — 2 min read

Photo Credit: Nvidia

At the AI Infrastructure Summit earlier today, Nvidia announced the Rubin CPX, a purpose-built GPU based on the next-gen Rubin GPU architecture. How is this new category of GPUs designed for AI inference different and what is Nvidia trying to do in the data centre?

Here's what you need to know about it and its impact in the data centre.

Made for inferencing

The names sure get confusing. To recap, the upcoming Rubin architecture is the successor to the current-gen Blackwell (i.e. B200s). On its part, the Rubin CPX is a completely new class of GPU, designed just for post-training workloads. It features 4-bit math and a monolithic die for lower costs, plus more affordable 128GB on-package GDDR7 RAM.

At its core, the Rubin CPX is an AI inference chip designed to work with large amounts of data or large "million-token context" windows. Nvidia touted its strength in "long-context" inference processing, enabling AI models to look at entire codebases for coding or do video search and high-quality generative video.

To support that, the Rubin CPX packs 3x as much memory as the current multi-workload LS40 GPU.

Deploying the Rubin CPX

To be clear, the Rubin CPX does not handle the generative phase of AI by itself but uses a new "disaggregated inference architecture" to work with other Nvidia GPUs. Prefill and context are handled by Rubin CPX, while AI generation is done by Rubin GPUs. This effectively means users will get locked further into the Nvidia ecosystem.

How much power does the Rubin CPX require? A full NVL144 CPX rack will fit into the same power envelope as the Blackwell NVL72, which means data centre operators can easily add CPX capacity without retrofitting.

The Rubin CPX can be installed as peripheral cards in existing servers or deployed in specialised solutions. It will debut by the end of 2026, which in my mind means 2027 for broader availability.

The future of AI lock-in

What is Nvidia trying to do here? The Rubin CPX deepens reliance on the company's end-to-end AI platform and its place in data centres, but there's more to it than just lock-in.

Nvidia is moving fast to defend the inference portion of its AI factory strategy by lowering cost and increasing inference output. The disaggregated architecture makes sense from an efficiency standpoint - why use expensive training GPUs for inference when a purpose-built, lower-cost chip can handle that workload?

This approach could actually reduce total cost of ownership for AI deployments. Companies can scale inference capacity independently from training capacity, optimising their infrastructure spending based on actual workloads.

Of course, there's a catch. The Rubin CPX works best with other Nvidia GPUs in that disaggregated architecture. Once you've invested in CPX for inference, switching to competitors becomes more complex. With AMD and Intel pushing hard into AI inference, the timing isn't coincidental.

What do you think of the Rubin CPX and Nvidia's latest move? Is disaggregated inference a genuine innovation that benefits data centres, or primarily a strategic play to maintain market dominance?

Nvidia splits AI workloads with new Rubin CPX chip

Paul Mah

Made for inferencing

Deploying the Rubin CPX

The future of AI lock-in

Read more

The reality behind Singapore's data centre strategy

Inside the race condition that broke AWS’s triple-redundant system

Why knowing AI won't save your job in the coming wave of layoffs

Nvidia's blueprint to build gigawatt-scale AI data centres faster