Recent nightly static and local builds of Netdata Agent overwrite netdata.conf with defaults
Incident Report for Netdata
Postmortem

Prior to netdata/netdata#17475, the netdata.conf and netdata-updater.conf files where handled by the installer code outside of the build system. With the shift to using the build system to produce packages, handling for them needed to be moved into the build system. However, insufficient testing was performed to confirm that this would not break other installation types, and the change was not properly made conditional on packages being built.

As a result, the static and local builds with version v1.45.0-315-nightly will overwrite these configuration files with the default templates for those files. This causes all local changes to those files to be lost. In particular, if the Agent configuration had been changed for longer retention, the overwritten configuration will have undone those settings, causing any metrics data beyond the default retention to be lost on the first run of this version.

We have pulled the affected build artifacts to prevent our installer from using them.

While the fix ensures the issue won't occur in future versions, starting with version v1.45.0-326-nightly, it is important to note that affected installations will not automatically recover their previous configurations. If you were using a non-default netdata.conf and/or netdata-updater.conf and experienced this bug, you will need to manually reconfigure your Netdata install.

As we aim to carefully develop Netdata for many platforms and hardware architectures, we release nightly builds of the Netdata Agent to catch any issues our changes may have caused, beyond our own internal testing. Unfortunately, we make mistakes that we did not catch in our testing, with data loss as an extreme possible outcome. Therefore we strongly recommend using our stable releases for production systems. You can review the difference between nightly and stable builds, and our recommended best practices.

If you have been affected by this issue and/or have any questions, please let us know.

Posted May 02, 2024 - 10:48 UTC

Resolved
The build artifacts for the new nightly release (1.45.0-326) are now available, and consider the incident resolved. Should you experience any issues, please let us know!
Posted May 01, 2024 - 17:22 UTC
Update
Update regarding potential data loss. This will happen if the configuration had been changed to increase metric retention (with respect to the defaults). Unfortunately, any stored data beyond the default metric retention will be lost on running installs of the affected builds.

The only way to prevent this is by not using (of having used) version v1.45.0-315-nightly. We have made sure that the corresponding artifacts are no longer accessible by the installer.
Posted May 01, 2024 - 16:28 UTC
Update
The affected build number is v1.45.0-315-nightly, and local builds starting with commit https://github.com/netdata/netdata/commit/5973417027606bacf044b3ead40a882931ce773f (April 30, 11:45 UTC) up until commit https://github.com/netdata/netdata/commit/0f2a261839d5ffc42f17383b4292673aa93d6a1f (May 1, 15:13 UTC).
Posted May 01, 2024 - 15:59 UTC
Identified
We've identified an issue with static and local builds of the Netdata Agent, that causes its main configuration in `/etc/netdata/netdata.conf` or `/opt/netdata/etc/netdata/netdata.conf` to be overwritten with the default. The `netdata-updater.conf` file is similarly affected.

Depending on your configuration settings that have been changed with respect to the defaults, this may result in data loss. We will update this incident with more detailed information on the impact as soon as possible.

Docker image or native package builds, as well as stable builds, are not affected.

We have created a fix (https://github.com/netdata/netdata/pull/17572) and have triggered a new nightly build. As soon as those are available, we will also update this incident.
Posted May 01, 2024 - 15:44 UTC