Write-up published
Resolved
CloudFlare reports that their services are fully available again and safe to use. We did not observe any further issues or problem reports for the past hours. Therefore, the incident is resolved.
We will provide a post mortem in due time and also work actively on reducing our CloudFlare dependencies.
Monitoring
More information on the impact of the incident as Langfuse runs behind Cloudflare’s proxy:
During this incident, most traces did not reach our ingestion endpoints and are permanently lost. If tracing events reach the Langfuse API, we have systems in place to replay them in case of service disruptions. However, a replay of lost events is not possible since these events never reached our infrastructure. All events accepted with 2xx status codes are or will be processed, while all events that logged errors (5xx) are lost.
Long-running services still have access to prompts via SDK-level caching. New deployments or restarts of your application have led to downtime as the Langfuse Prompt Management API was unavailable.
The Langfuse UI has mostly been inaccessible during the incident.
We use CloudFlare as our registrar, DNS management tool, and proxy which prevented us from making routing updates throughout the incident to restore services earlier.
As a follow up to this incident, we are working on a post mortem (will be published here) and on removing Cloudflare as a dependency to mitigate this kind of impact in case of future Cloudflare incidents.
Monitoring
CloudFlare confirmed that they have applied a fix and believe the incident to be resolved.
Monitoring
For the past 10min we see requests being successfully processed across all environments again.
Monitoring
As a workaround to this incident, customers may adjust their DNS resolution to skip CloudFlare and connect directly to our LoadBalancers.
For this to work, the following mapping applies:
cloud.langfuse.com: prod-eu-loadbalancer-1283300419.eu-west-1.elb.amazonaws.com;
us.cloud.langfuse.com: prod-us-loadbalancer-901843093.us-west-2.elb.amazonaws.com;
hipaa.cloud.langfuse.com: prod-hipaa-loadbalancer-906051818.us-west-2.elb.amazonaws.com.
You may observe elevated latencies by using those endpoints directly.
We strongly advice to threat this like a temporary patch and we do not give any guarantee at all to support those endpoints beyond the duration of this incident and the following two days. Be prepared to remove this patch as soon as the incident is resolved.
Monitoring
Cloudflare has identified the issue and is working towards full restoration of service access. We're seeing a small number of requests being processed successfully with the majority being blocked.
Monitoring
We got an update from CloudFlare that services are recovering. We're still seeing sporadic errors across the stack.
Monitoring
We see that our services are recovering and serving traffic again. We're actively monitoring the situation.
Monitoring
We're currently observing elevated error rates across all services due to an issue on our edge network (CloudFlare). This is affecting us and a multitude of our tools. We're working on restoring access to the application.