All Systems Operational
Cloud Web UI Operational
90 days ago
99.79 % uptime
Today
Agent-Cloud Link (ACLK) Operational
90 days ago
99.83 % uptime
Today
Agent Services Operational
90 days ago
99.8 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Apr 23, 2021

No incidents reported today.

Apr 22, 2021

No incidents reported.

Apr 21, 2021
Resolved - Incident is resolved.
Apr 21, 20:31 UTC
Update - We observe nominal behavior on all micro-services. Pulsar is consuming messages on a normal rate. We continue to monitor all services.
Apr 21, 20:28 UTC
Monitoring - We have rolled back all Pulsar replications. Services are stabilized. Residual effects (like late notifications) may still exist at this point. We continue to monitor the services.
Apr 21, 19:54 UTC
Identified - Geo-replication on pulsar partially failed due to increased RAM requirements. The extra requirements forced Kubernetes to restart specific pods. As a result some messages have been transmitted out of order, and many notifications were transmitted with delay.
Apr 21, 19:52 UTC
Update - Web UI is back on line. Service is partially restored.
Apr 21, 18:38 UTC
Update - It seems that netdata messages are not properly consumed. The issue relates to a new Pulsar replication that was introduced today. We proceed with immediate roll-back.
Apr 21, 18:30 UTC
Investigating - We are currently investigating the issue.
Apr 21, 18:27 UTC
Apr 20, 2021

No incidents reported.

Apr 19, 2021

No incidents reported.

Apr 18, 2021

No incidents reported.

Apr 17, 2021

No incidents reported.

Apr 16, 2021

No incidents reported.

Apr 15, 2021

No incidents reported.

Apr 14, 2021

No incidents reported.

Apr 13, 2021
Resolved - All service indicators are nominal. Incident is considered resolved.
Apr 13, 16:07 UTC
Monitoring - The fix is applied and we are monitoring the performance.
Apr 13, 15:08 UTC
Update - Staging tests on the new fix completed successfully. We proceed with deployment to production.
Apr 13, 14:19 UTC
Identified - Τhe issue is identified on a new query introduced on the latest release. Immediate fix is applied and currently under testing.
Apr 13, 14:15 UTC
Update - We are proceeding to drop messages on queue in order to reduce load on the DB and bring the system to a normal state. If the problem is not resolved, we are going to proceed with rollback of latest updates.
Apr 13, 13:46 UTC
Investigating - We are currently investigating the issue. MongoDB experiences high load.
Apr 13, 13:44 UTC
Resolved - Since services is stable for the last 12 hours, we proceed on declaring the case as resolved.
Apr 13, 09:10 UTC
Update - The service is stable and all messages from agents are consumed properly. We will continue to monitor for any inconsistencies and close the incident in the coming hours.
Apr 13, 07:17 UTC
Update - Monitoring continues. We see no anomalies so far.
Apr 13, 07:15 UTC
Update - Service under monitoring.
Apr 12, 20:15 UTC
Monitoring - A few Kubernetes pods experienced a race condition hanging on a response from redis services. The issue has been resolved, and currently all messages are properly handled. We will leave the additional resources in place, and monitor performance and stability during the following hours. Appropriate root cause analysis will follow with redis. Further to that additional monitoring metrics will be introduced, in order to react / rectify future similar incidents.
Apr 12, 20:14 UTC
Update - We continue to investigate the root cause. We have added more pods on K8 in an effort to reduce the messages dropped, and we have significantly improved the message consumption, but we still observe messages lost. Further updates will follow once we identify what is causing the issue.
Apr 12, 18:37 UTC
Update - Issue still under investigation.
Apr 12, 18:35 UTC
Investigating - We are currently investigating the issue.
Apr 12, 18:35 UTC
Apr 12, 2021
Apr 11, 2021

No incidents reported.

Apr 10, 2021

No incidents reported.

Apr 9, 2021

No incidents reported.