What really happened at SG1 over the weekend?

New questions.

What really happened at SG1 over the weekend?
Photo Credit: Equinix. Photo of a data hall in SG5 data centre, which I visited and wrote about in 2022.

It's a couple of days since two data halls at Equinix SG1 data centre experienced a power outage. What really happened and was it preventable?

On Saturday, my post about SG1's outage, published simultaneously with Jan's report on W.Media, were the first to report on the incident.

A quick recap

Here's the official statement I shared on Sat:

"Earlier today during scheduled maintenance at SG1 that required a reduction in power redundancy, we experienced equipment failure resulting in a power outage to two data halls in the data centre. The issue has been resolved." - Equinix.

In my initial post, I cited a source who told me a mistake was made. Shortly after, Equinix contacted me to assert that there was no human error.

For now, another contact tells me there's no root cause analysis (RCA) yet, which looks at the underlying cause of a fault. I'm sure that an RCA will be released when ready.

Connecting the dots

For now, here's additional pointers that came to mind after reading and thinking through the 70+ comments:

  • SG1 data centre is 25 years old,
  • though upgraded over the years.
  • Power supply configuration is unusual.

Some new questions:

a. Why didn't backup power kick in immediately?

b. Why did critical equipment fail at exact time of maintenance, as alluded to by the official statement?

c. Equinix did not dispute that the gantry system was briefly down. Was it because servers powering it were in affected data halls? This could be a noteworthy lesson for data centre operators.

When the power goes out

The Straits Times ran a report about ViewQwest customers experiencing disruptions first at 11.15am and then again at 4.30pm.

Assuming they were in SG1, why would that be so?

Short answer: Abrupt power loss has a nasty habit of killing systems.

  • Firmware corruption.
  • Damage to PSUs.
  • Hardware failure.

And won't you believe it, the power for my block went out briefly today. When it came on, my Wi-Fi controller no longer boots, leaving my wireless APs offline.

We need to talk

I'm hopeful that an RCA will be released soon to offer some insights into not just what went wrong at Equinix SG1, but how others in the data centre space can learn from this incident.

Data centres are vital building blocks of modern societies. As an industry, we need to move beyond the secrecy of the past into having the tough but candid conversations we need.

Only then can we get better together.