This startup says it can diagnose data centre faults in 38 seconds

A pilot with a Singapore operator will put that to the test.

This startup says it can diagnose data centre faults in 38 seconds
Photo Credit: Paul Mah. Singapore Tropical Data Centre Testbed (STDCT).

When something fails in a data centre today, it can take up to an hour just to find it. Fran Zaide of Necora told me her startup cuts that to as little as 38 seconds.

Fran reached out after seeing a wefie of me and James Rix posted by Ross Bendix in February. Though I might not have been very responsive initially, I was glad we eventually met. Here's what I learned.

The trouble with troubleshooting

Modern data centres are heavily networked with advanced monitoring and control systems, from DCIM and BMS to SCADA. It should have made finding problems easy. In reality, root cause analysis is often a convoluted, manual process. The cause might not even be with the system that fired the alarm.

Let's say system temperature is up. Is it the chillers, CRACs, CDUs, or the GPUs themselves? Operators often have to work through the entire dependency chain to ascertain the root cause. Throw in complex topologies, manual SOPs, and the occasional false alarm, and it's no wonder that three out of four outages stem from a misdiagnosis or late diagnosis by human operators.

There's a model for that

When Fran told me her team has created an AI system called PodIQ that addresses this head on, I was sceptical. Truth be told, firms are slapping an "AI" label on everything these days.

However, the team counts experts in both AI and data centre operations among its supporters. So how does it work? PodIQ runs off a 1U server with an onboard GPU, housing a local GenAI physics-aware reasoning engine, a digital twin of the specific data centre, and 90 days of raw telemetry data.

The result is deterministic reasoning, with full traceability for operators to see how every conclusion is reached. This is an important distinction. Rather than a black box that simply outputs an answer, the system shows its working. Fran says that future iterations of PodIQ will leverage machine learning to offer predictive capabilities as well.

Going into pilot

How well does it work? I have no way to know. So far, the team has done early modelling using open-source data and is also working with the Research Institutes of Sweden (RISE) for data sets.

Fran and her co-founder Ina Mae Leoro point to their atmospheric dynamics and climate modelling experience, which gives them a different lens on the problem. The physics of heat dissipation and airflow in a data centre is, after all, not entirely unlike modelling weather systems.

It's early days for PodIQ, and only real-world deployments will tell the full story. But there are two things going for them. Fran and Ina are currently in Singapore as part of Antler's extremely selective residency programme for founders. And the startup says it has already signed an MOU with a data centre operator in Singapore to deploy a pilot.