VerneMQ / Pulsar drops messages without processing them.
Incident Report for Netdata
Resolved
Since services is stable for the last 12 hours, we proceed on declaring the case as resolved.
Posted Apr 13, 2021 - 09:10 UTC
Update
The service is stable and all messages from agents are consumed properly. We will continue to monitor for any inconsistencies and close the incident in the coming hours.
Posted Apr 13, 2021 - 07:17 UTC
Update
Monitoring continues. We see no anomalies so far.
Posted Apr 13, 2021 - 07:15 UTC
Update
Service under monitoring.
Posted Apr 12, 2021 - 20:15 UTC
Monitoring
A few Kubernetes pods experienced a race condition hanging on a response from redis services. The issue has been resolved, and currently all messages are properly handled. We will leave the additional resources in place, and monitor performance and stability during the following hours. Appropriate root cause analysis will follow with redis. Further to that additional monitoring metrics will be introduced, in order to react / rectify future similar incidents.
Posted Apr 12, 2021 - 20:14 UTC
Update
We continue to investigate the root cause. We have added more pods on K8 in an effort to reduce the messages dropped, and we have significantly improved the message consumption, but we still observe messages lost. Further updates will follow once we identify what is causing the issue.
Posted Apr 12, 2021 - 18:37 UTC
Update
Issue still under investigation.
Posted Apr 12, 2021 - 18:35 UTC
Investigating
We are currently investigating the issue.
Posted Apr 12, 2021 - 18:35 UTC
This incident affected: Agent Services.