Since services is stable for the last 12 hours, we proceed on declaring the case as resolved.
Posted Apr 13, 2021 - 09:10 UTC
The service is stable and all messages from agents are consumed properly. We will continue to monitor for any inconsistencies and close the incident in the coming hours.
Posted Apr 13, 2021 - 07:17 UTC
Monitoring continues. We see no anomalies so far.
Posted Apr 13, 2021 - 07:15 UTC
Service under monitoring.
Posted Apr 12, 2021 - 20:15 UTC
A few Kubernetes pods experienced a race condition hanging on a response from redis services. The issue has been resolved, and currently all messages are properly handled. We will leave the additional resources in place, and monitor performance and stability during the following hours. Appropriate root cause analysis will follow with redis. Further to that additional monitoring metrics will be introduced, in order to react / rectify future similar incidents.
Posted Apr 12, 2021 - 20:14 UTC
We continue to investigate the root cause. We have added more pods on K8 in an effort to reduce the messages dropped, and we have significantly improved the message consumption, but we still observe messages lost. Further updates will follow once we identify what is causing the issue.