Agent connectivity disruption

Incident Report for Netdata

Resolved

As we see the number of connected agents go back to expected levels, and the number of agents running the previous nightly going down, we consider this incident resolved.
Posted Dec 15, 2022 - 06:58 UTC

Monitoring

The new build (1.37.0-55) has completed for most platforms. Please follow the instructions at https://learn.netdata.cloud/docs/agent/packaging/installer/update if you are on the affected version (1.37.0-48) and want to upgrade your agents manually. If you have automatic updates configured, you can also wait for the update to be done during your night.

We will be monitoring the progress of Agents as they reconnect.
Posted Dec 14, 2022 - 19:02 UTC

Update

The new build (1.37.0-55) has been triggered and we will post an update when it is ready. We will include instructions on how to update manually, or you can wait until the auto-upgrade happens during your night.

Note:
* If you are running a nightly build older than 1.37.0-48, you are not affected and no action is required.
* If you are running a stable build, you are not affected and no action is required. However, we do strongly recommend upgrading to 1.37.1 because of two security vulnerabilities in older versions.
Posted Dec 14, 2022 - 17:17 UTC

Identified

We have identified the offending change in the Agent.

Only the latest nightly build (1.37.0-48-nightly) of the Agent is affected. The problem only occurs if the Agent tries to reconnect after having lost its first connection to Cloud. This means that if you restart your agent, the problem is avoided until its connection to Cloud drops.

We will issue a new nightly build that removes the offending change.
Posted Dec 14, 2022 - 15:56 UTC

Update

We are able to reproduce the issue and are attempting to pinpoint the cause.
Posted Dec 14, 2022 - 14:38 UTC

Investigating

We are seeing an increasing number of Agents that cannot (properly) connect to Cloud. We are investigating the cause, but initial indications are that it may be related to the latest nightly release of the Agent (version 1.37.0-48-nightly).
Posted Dec 14, 2022 - 06:26 UTC
This incident affected: Agent - Cloud Connection (ACLK).