We observe full recovery across all API and UI routes.
Latencies and error rates have recovered.
Maintenance has completed
The underlying issue was resolved.
We resolved the underlying issue. A distributed cache roll out for ClickHouse caused the latency and error spikes. The roll out was reverted and we see normal API performance as of now.
We caught up with our queues and have regular processing times again.
All data is processed in time.
We've reverted a bad patch and see recovery. All routes should work as expected.
The system returned to a stable state.
We've caught up again and improved our alerting to detect issues like this more proactively going forward.
We identified the issue, added a patch and worked through the backlog. All events should be processed as usual again.
Ingestion API are back to normal
The issue has been resolved and we're back to the usual processing times.
There should be no more ingestion delay.
Query performance recovered. System operational.
The processing has been stable for several hours now.
We've completed the catch-up for experiment spans and resolved the problem.
Degraded database read replicas were re-created and now serving the public API path. Public API latency remains stable.
The LLM-as-a-Judge delays on HIPPA data region have now been resolved. We will continue to monitor the situation.
All data is processed in time now.
We resolved the issue by scaling our infrastructure.
The backlog was processed and ingestion delays are below 30 seconds again.
We've adjusted scaling in the system and currently see a delay of below 5min for LLM as a judge executions.
This issue has been resolved.
The system returned to a fully stable state.
The issue has been resolved.
We've mitigated the impact and event processing delays are below 60 seconds and falling continuously.
We scaled our systems and resolved the issue.
We scaled our systems and are able to process data in time. Single projects may still be delayed.
We scaled our systems. APIs should respond fast again.
The backlog is being processed and we're below 60s processing delay.
We scaled our systems and queries should respond in time.
We are now processing all llm as a judge executions in time.
·