Reading between the lines of the Singtel outage
Singtel shared few details. But clues point to a cooling-related failure.
Earlier this week, an outage at Singtel affected some 600,000 customers. What happened exactly? Here's what we can infer.
600,000 customers hit
Singtel customers had it tough this week. On Monday morning, as many as 600,000 mobile customers were hit with an outage on their phones. That's a significant number for a market as small as Singapore.
Singtel says 4G services for affected customers were restored by around 1.30pm and 5G services were progressively restored from about 2.45pm the same day. "Most" customers could connect to 5G within two hours, though it is understood some had to restart their phones first and may not have known to do so.
An unrelated software bug on Tuesday caused further disruption, albeit to a much smaller group of around 2,000 customers. Two incidents in two days did little to inspire confidence.
"Mechanical fault"
According to CEO Ng Tian Chong, the initial problem stemmed from a "mechanical fault" at one of Singtel's network facilities. The choice of words is worth noting: mechanical faults typically refer to physical infrastructure and very specific systems.
"Though multiple redundancy measures were in place to support seamless service continuity, our situation required reconfiguration which took time to fully take effect," said Singtel.
To be clear, nobody I contacted knows what happened. However, I understand that IMDA is taking the outages extremely seriously.
An educated guess
If I were to make a guess, I would say this is cooling-related. Mobile networks run off core systems that manage things such as authentication, session management, and data routing. These are housed within highly resilient facilities. Call them what you like, but they won't look very different from a traditional data centre.
If cooling, which is a mechanical system, goes down and cannot be restored, the only option is to migrate to a backup system. Such diversion of data traffic takes time for network routes to be programmed and to propagate. That would be consistent with the hours-long recovery timeline that Singtel described.
It is worth noting that Singtel is also in the process of closing five older data centres in Singapore and moving to its new DC Tuas facility. These are generally very small. Some, like Comcentre 3, are "just a handful of racks." While there is nothing to suggest the migration and the outage are related, system migrations do carry inherent risks. And the timing, at the very least, invites the question.