OpenAI halts Sora sign-ups amid overwhelming launch day demand

OpenAI

American artificial intelligence research organization

Provocative

Highlights

OpenAI's services, including ChatGPT and Sora, faced significant disruptions due to a new telemetry service failure.
The outage began around 3 p.m. PT and took approximately three hours to resolve.
OpenAI will implement measures to prevent similar incidents in the future, including better monitoring and accessible Kubernetes API servers.

Story

On December 11, 2024, OpenAI experienced one of its longest outages affecting its services, including ChatGPT and Sora. The disruption, which began around 3 p.m. PT, resulted from a new telemetry service that was intended to collect metrics for Kubernetes, which is crucial for managing containers in their software infrastructure. The situation worsened as the telemetry service caused resource-intensive operations, overwhelming the Kubernetes API servers. After a few hours, OpenAI announced a pathway to recovery. This major service outage was unprecedented for the company and had significant repercussions for users worldwide. During the outage, users were unable to access ChatGPT and Sora, with services becoming progressively unavailable. OpenAI initiated an investigation shortly after identifying the issue and began work on a fix by 4:55 p.m. PT the same day. By 7 p.m. PT, the company provided updates on its status page indicating that they were slowly restoring services. Despite the company’s recognition of the public demand, exacerbated by a recent product launch, the technical issues stemming from the new telemetry service were distinct and caused the prolonged outage. OpenAI published a postmortem on December 12, where it clarified that the service interruption was not prompted by a recent product launch or a security incident. They explained that telemetry services have a broad operational footprint and that their configuration had unintended consequences on Kubernetes' performance. The company further elaborated that their use of DNS caching complicated the situation, causing delays in visibility during the rollout, further impeding their ability to troubleshoot effectively. Their efforts to identify and resolve the issue were hindered by locked-out access to essential servers. In response to this incident, OpenAI committed to implementing changes designed to enhance their monitoring and management processes. They plan to adopt improved phased rollouts and strengthen mechanisms enabling engineers to access critical API servers under any circumstances. The apology issued by OpenAI emphasized their accountability and acknowledgment of falling short of their customer’s expectations. The company now faces the challenge of regaining user trust and ensuring stability in service as they work through the implications of this outage and prepare for future operational resiliency.

Opinions

You've reached the end