The system returned to a stable state.
We've caught up again and improved our alerting to detect issues like this more proactively going forward.
We identified the issue, added a patch and worked through the backlog. All events should be processed as usual again.
Ingestion API are back to normal
The issue has been resolved and we're back to the usual processing times.
There should be no more ingestion delay.
Query performance recovered. System operational.
The processing has been stable for several hours now.
We've completed the catch-up for experiment spans and resolved the problem.
Degraded database read replicas were re-created and now serving the public API path. Public API latency remains stable.
The LLM-as-a-Judge delays on HIPPA data region have now been resolved. We will continue to monitor the situation.
All data is processed in time now.
We resolved the issue by scaling our infrastructure.
The backlog was processed and ingestion delays are below 30 seconds again.
We've adjusted scaling in the system and currently see a delay of below 5min for LLM as a judge executions.
This issue has been resolved.
The system returned to a fully stable state.
The issue has been resolved.
We've mitigated the impact and event processing delays are below 60 seconds and falling continuously.
We scaled our systems and resolved the issue.
We scaled our systems and are able to process data in time. Single projects may still be delayed.
We scaled our systems. APIs should respond fast again.
The backlog is being processed and we're below 60s processing delay.
We scaled our systems and queries should respond in time.
We are now processing all llm as a judge executions in time.
The ingestion delay it back down and events are processed near real-time.
The incident has been resolved by a patch. The underlying issue was an unbounded data query data kept fetching data and forced our containers to be out of memory killed.
The issue has been resolved, and the LLM-as-a-Judge Evaluators are now executed again as expected.
We've caught up on all outstanding events and are processing newly received events within our usual timeframes.
·