We notice cloud application degrated performance

Incident Report for Netdata

Resolved

Incident is resolved.

Posted Apr 21, 2021 - 20:31 UTC

Update

We observe nominal behavior on all micro-services. Pulsar is consuming messages on a normal rate. We continue to monitor all services.

Posted Apr 21, 2021 - 20:28 UTC

Monitoring

We have rolled back all Pulsar replications. Services are stabilized. Residual effects (like late notifications) may still exist at this point. We continue to monitor the services.

Posted Apr 21, 2021 - 19:54 UTC

Identified

Geo-replication on pulsar partially failed due to increased RAM requirements. The extra requirements forced Kubernetes to restart specific pods. As a result some messages have been transmitted out of order, and many notifications were transmitted with delay.

Posted Apr 21, 2021 - 19:52 UTC

Update

Web UI is back on line. Service is partially restored.

Posted Apr 21, 2021 - 18:38 UTC

Update

It seems that netdata messages are not properly consumed. The issue relates to a new Pulsar replication that was introduced today. We proceed with immediate roll-back.

Posted Apr 21, 2021 - 18:30 UTC

Investigating

We are currently investigating the issue.

Posted Apr 21, 2021 - 18:27 UTC

This incident affected: Cloud Web UI and Agent (all platforms).