Degraded APIs in US data region
Resolved
Jul 06 at 12:05am CEST
Enabling Open Telemetry-based instrumentation on the API routes hosted on Vercel resulted in Gateway Timeouts (HTTP 504) for a portion of requests from 6:00 PM to 9:00 PM (UTC).
The share of 504 timeouts across different parts of the application was as follows:
- US overall: 2.77%
- US Public API (/api/public*): 2.87%
- US Tracing (/api/public/ingestion): 0.89%
- US Prompt Management (/api/public/prompts): 14.01%
The behavior of Langfuse SDKs when the API was partially unavailable:
- Tracing: The Langfuse SDKs retried each batch of trace events, which helped reduce some of the data loss during this partial outage.
- Prompt Management: The Langfuse SDKs cached fetched prompts and served the stale cache if the Langfuse API was not available. In cases where there was no cached prompt version (e.g., after redeployment or new instances), this partial outage may have caused runtime exceptions.
Langfuse v3 will introduce a series of infrastructure changes to increase the overall robustness of the service and the observability stack. Also we are currently migrating to EKS on AWS.
The change that led to this incident is part of the overall effort and intended to add Open Telemetry-based observability to all services of the core application. While this change worked flawlessly on the EKS-based services, it negatively affected the current production deployment of the APIs on Vercel.
The EU data region was completely unaffected by this issue.
Affected services
[US] Health
[US] Trace Ingestion
[US] Prompts API
Updated
Jul 05 at 11:01pm CEST
We have reverted a change to our observability stack that caused these issues. All APIs are now fully operational. We will post a post-mortem shortly.
Affected services
[US] Health
[US] Trace Ingestion
[US] Prompts API
Created
Jul 05 at 08:00pm CEST
Some APIs timeout (HTTP 504 - Gateway Timeout) in the US data region, we are investigating this issue.
Affected services
[US] Health
[US] Trace Ingestion
[US] Prompts API