We notice cloud application degrated performance
Incident Report for Netdata
Resolved
Incident is resolved.
Posted Apr 21, 2021 - 20:31 UTC
Update
We observe nominal behavior on all micro-services. Pulsar is consuming messages on a normal rate. We continue to monitor all services.
Posted Apr 21, 2021 - 20:28 UTC
Monitoring
We have rolled back all Pulsar replications. Services are stabilized. Residual effects (like late notifications) may still exist at this point. We continue to monitor the services.
Posted Apr 21, 2021 - 19:54 UTC
Identified
Geo-replication on pulsar partially failed due to increased RAM requirements. The extra requirements forced Kubernetes to restart specific pods. As a result some messages have been transmitted out of order, and many notifications were transmitted with delay.
Posted Apr 21, 2021 - 19:52 UTC
Update
Web UI is back on line. Service is partially restored.
Posted Apr 21, 2021 - 18:38 UTC
Update
It seems that netdata messages are not properly consumed. The issue relates to a new Pulsar replication that was introduced today. We proceed with immediate roll-back.
Posted Apr 21, 2021 - 18:30 UTC
Investigating
We are currently investigating the issue.
Posted Apr 21, 2021 - 18:27 UTC
This incident affected: Cloud Web UI and Agent Services.