Apr 15, 2025, 12:00 AM
Apr 15, 2025, 12:00 AM

Google Cloud faces outage after UPS failure during power loss

Highlights
  • On March 29th, a power outage in the us-east5-c zone caused a significant interruption in Google Cloud services.
  • The outage was worsened by a critical failure of uninterruptible power supplies (UPSes), which failed to provide backup power.
  • Google is committed to preventing future outages by improving power resilience and auditing backup systems.
Story

In the United States, specifically in the us-east5-c zone located in Columbus, Ohio, Google experienced a significant outage on March 29th. This incident lasted for six hours and resulted in degraded services or total unavailability for over 20 Google Cloud services. The outage was triggered by a loss of utility power, which is typically managed by uninterruptible power supplies (UPSes) designed to provide immediate backup power. However, Google's UPSes failed due to a critical battery failure, leaving services without backup power. Furthermore, the failure prevented diesel-powered generators from activating, as engineers had to manually bypass the UPS systems before restoring power. Engineers were alerted to the outage at 12:54 PM Pacific Time, and generators were fully operational by 2:49 PM, allowing services to begin recovery. The majority of Google Cloud services were restored shortly after, while some required more time for manual intervention to achieve full recovery. Google's incident report acknowledged the critical failure of their power infrastructure and noted their commitment to preventing similar incidents in the future. To address the flaws that led to the outage, Google outlined several actions to improve their systems. They pledged to harden cluster power failure recovery paths to enable quicker service restoration after power interruptions. Additionally, they committed to auditing their systems to ensure automatic failovers and to address any issues that were identified in their battery backup systems by collaborating with their UPS vendor. This approach aims to enhance the resilience of their cloud services and reduce the risk of future outages. The occurrence of this outage serves as a stark reminder of the necessity for regular testing of disaster recovery infrastructure and procedures in cloud services. Hyperscale data providers like Google promise high resilience and uptime; however, incidents like this highlight that even the most robust systems can fail under unexpected circumstances. Businesses and users relying on cloud services should recognize the potential impact of such outages and remain prepared for the realities of service interruptions.

Opinions

You've reached the end