Liquid cooling leak damages millions of dollars in GPUs
Overhead pipe mishap in Southeast Asia floods data centre aisle, proving liquid cooling's biggest fear.

When liquid cooling leaks, the only thing flowing is panic. Well, a leak at a data centre yesterday took out a GPU cluster. It's a disaster.
I received this tip from a trusted source. The incident perfectly illustrates the risks I've previously warned about with the rush to liquid cooling.
Liquid cooling
Just weeks ago, I wrote about how liquid cooling is all the rage but cautioned enterprises against rushing into it. The reason I said that: Most non-AI workloads don't need it.
In my previous post, I've highlighted some downsides of liquid cooling, namely potential issues with having a single point of failure and the risks of water in the data hall. Risk of leaks is also non-zero, greater complexity than air cooling, and how the PG25 commonly used can break down.
Of course, the latest GPU servers demand direct-to-chip liquid cooling. So organisations rolling out AI workloads might have no choice. That's the cruel irony - the workloads that cost the most to deploy are also the ones that require the riskiest cooling methods.
So, what happened?
Yesterday, a data centre in Southeast Asia experienced a leak that knocked a cluster of the latest GPU servers out of commission. In the video I saw, the aisle floor was covered with a thick layer of water, which employees were trying to squeegee away. An overhead pipe had burst, affecting a dozen racks.
I'm sure most racks were switched off as a precaution. But at a few million USD per rack, damage to even one would be an unmitigated disaster - the financial implications are staggering. I've kept details sparse because my objective is for learning and discussion, not to name and shame.
What can we do?
Direct-to-chip liquid cooling is still relatively new in Southeast Asia. We have large scale deployments, albeit in just a handful of data centres. One hypothesis is poor workmanship due to a shortage of skilled manpower. Given layers of subcontractors, is this something that data centre operators can prevent from happening?
A suggestion I've heard is to lay liquid cooling pipes under server racks instead of above, complete with water traps. This would mean higher cost for the reinforced raised flooring though.
Of course, two-phase liquid cooling or immersion cooling, both of which don't use water, could be safer alternatives. On the other hand, they work quite differently and won't suit certain Nvidia products.
The incident highlights a fundamental tension in the industry. We need liquid cooling for AI workloads, but we're still learning how to deploy it. Yesterday's disaster might not be the last.