Do you really need liquid cooling? Why you should think twice

The hidden complexities that vendors don't tell you about liquid cooling.

Do you really need liquid cooling? Why you should think twice
Photo Credit: Paul Mah

Liquid cooling is all the rage. But before you rush to retrofit your enterprise data centre for high-density workloads, have you properly considered the risks associated with liquid cooling?

But is it necessary?

Yes, AI is on every other headline these days. And if one were to take what certain brands are insinuating, all data centres will be 100kW by next year. However, I don't believe this will be the case.

As I've previously written, data centres are diverging into AI data centres and traditional data centres. I do believe rack densities will continue to increase for both, but at quite different rates.

Large-scale AI deployments need systems in close proximity for reasons relating to scale-up capability, manageability, and AI-training performance. And they run really hot by default, so need high-density cooling. When you're packing thousands of GPUs together for training runs, liquid cooling becomes essential.

The same isn't true for enterprise workloads. HPC systems go up to 25kW per rack, while the average capacity, according to Uptime, is 8kW per rack in 2024. That's a significant gap between what's being marketed and what's actually deployed. Liquid cooling simply isn't mandatory at 8kW, 10kW, or even 25kW.

More complex than advertised

There are various downsides of liquid cooling that get omitted from discussions. After all, why mention them when AI deployments aren't possible without liquid cooling? Enterprises need to know though, to make an informed decision. Here are the key considerations.

Concurrent maintainability challenges

The hallmark of a Tier-3 data centre, concurrent maintainability allows a critical component in the data centre to be taken out of service for maintenance or replacement without impacting IT operations.

It's possible to design a concurrent maintainable deployment for liquid cooling. But expect complexity and costs to go up significantly. For example, you might need to double your CDU footprint to ensure proper redundancy. Every component needs a backup, and every backup needs proper isolation valves and bypass routes.

Single point of failure concerns

The next step up is ensuring that a data centre deployment isn't laid low by a catastrophic failure of any equipment. One approach here is to ensure no single point of failure exists.

Building fault tolerance is possible with liquid cooling, but it's considerably trickier. You need careful plumbing design, continuous monitoring systems, proactive leak detection, and modularity of components. Each of these adds layers of complexity and cost that don't exist with traditional air cooling.

Additional risks

There are also risks that aren't present in air-cooling. For one, the risk of leaks is non-zero, which can cause major problems when a non-dielectric liquid is used. Even a small leak can cascade into significant downtime.

Moreover, the PG20 commonly used in direct-to-chip cooling can potentially break down and turn acidic over time. This can accelerate corrosion and reduce heat transfer efficiency, requiring regular monitoring and maintenance of fluid chemistry.

The reality check

Liquid cooling is advancing rapidly, so the gap between its complexity and traditional cooling is narrowing. The technology is becoming more reliable and easier to manage. But it's still a significant step up from air cooling in terms of operational requirements.

Here's my advice: unless you're running actual AI training workloads or have racks consistently exceeding 30kW, you probably don't need liquid cooling. The added complexity, cost, and risk simply aren't justified for most enterprise workloads.

Don't let the AI hype drive infrastructure decisions that don't align with your actual workload requirements. The question isn't whether liquid cooling is the future – it's whether it's necessary for your specific present needs.