We narrowed the issue down to intermittent problems within our internal networking stack and elevated delays that caused cascading failures. In the meantime, we've added capacity as this reduces the likelihood of this becoming a problem and modified our autoscaling and alerting configuration to be more proactive in case issues like this reoccur.
We also engaged with our cloud provider to improve reliability around this critical piece.
We apologize for any inconvenience that this incident has caused.
Resolved
We narrowed the issue down to intermittent problems within our internal networking stack and elevated delays that caused cascading failures. In the meantime, we've added capacity as this reduces the likelihood of this becoming a problem and modified our autoscaling and alerting configuration to be more proactive in case issues like this reoccur.
We also engaged with our cloud provider to improve reliability around this critical piece.
We apologize for any inconvenience that this incident has caused.
Investigating
We're seeing a stable service state after adding capacity to the system. We're still investigating what the initial root cause for the service degradation was.
Investigating
Today, between 10:49am and 12:24pm UTC and again between 1:31pm and 1:41pm UTC we observed elevated latencies and timeout errors on span ingestions for our European environment. We've increased our processing capacity which has reduced errors in the meantime.
Our team is investigating the root cause and will provide and update here once we've found it.