Blog

CAPA 001: Toddler Night Light Outage

This is a somewhat tongue-in-cheek postmortem of an incident that recently impacted my home network (and my night’s sleep). Below, I cover the root cause analysis (RCA), immediate corrective actions, and preventive actions implemented to ensure it doesn’t recur.

It’s a small-scale example of a control-plane outage causing a cascading failure in the corresponding data-plane. I hope it shows how mature operational practices can improve availability even in small environments. And yes, maybe help me get more sleep.


Date: 2 February 2026
Role: Dad / Lead Site Reliability Engineer (Home Division)
Impact: Critical (Severity 1)

Executive Summary

Automatic updates applied to home-nas-1 triggered an automated restart. Prior configuration changes, specifically enabling link aggregation (bonding) on the host’s primary network interface, had not yet been validated by a reboot. These changes conflicted with the existing Docker network configuration. As a result, critical services running in Docker failed to restart, including Home Assistant.

Holly (my toddler) has a night-light and speaker setup built on an ESP32 and automated via Home Assistant in her room. Before this outage, I was not aware that ESPHome automatically restarts after 15 minutes if it loses connection to the Home Assistant API. As a result, both night-light availability and white-noise uptime were impacted.

Incident Timeline

  • 00:24: Synology DSM initiated an automated scheduled update. The NAS began a reboot sequence. The Docker daemon terminated.
  • 00:24: Immediate failure (audio). As the Music Assistant container terminated, the white noise stream ended. The toddler’s room was now silent.
  • 00:29: The NAS returned online. However, the network topology had shifted. The Docker stack attempted to launch but failed due to a network resource conflict. The control plane (Home Assistant) failed to start.
  • 00:39: Delayed failure (lighting). The Nursery Night Light (ESPHome) had been unable to reach the API server for 15 minutes. It triggered its internal watchdog, panic-rebooted, and reset to its safety default: RESTORE_DEFAULT_OFF. The toddler’s room went pitch black.
  • 01:30: The Alert. Holly woke up, disoriented by the pitch-black, silent room. Cue a high-volume alert system (a.k.a. screaming). Time to detect (TTD): 66 minutes.
  • 01:31: Acknowledgement. My wife and I woke up, acknowledged the alert, and divided and conquered. My wife settled Holly, notified me that the night light was unavailable, and I started investigating.
  • 01:45: Mitigation. I reconfigured the Docker network (eth0bond0), force-restarted all Docker Compose stacks, and re-executed Holly’s bedtime Home Assistant script to restore night-light state and white noise.

Root Cause Analysis (RCA)

  1. The trigger: The DSM update triggered a restart. While bonding eth0 and eth1 into bond0 had succeeded when the change was implemented, this restart was the first time Docker had attempted to configure the macvlan network since that change. As part of bonding, eth0 became unmodifiable, so Docker could no longer configure that network.
  2. The conflict: The Docker macvlan network was explicitly pinned to eth0 in configuration. When the OS enslaved eth0 to the new bond, Docker could not attach. As a result, dependent services failed to start.
  3. The cascading failure: The ESPHome light controller followed its programmed logic: If Server Lost → Reboot → Restore Default. Unfortunately, RESTORE_DEFAULT_OFF meant it defaulted to darkness.

Corrective Actions

  • Network reconfiguration: I deleted the existing macvlan Docker network, updated the Docker Compose configuration to reference bond0 rather than eth0, and forced recreation of containers across all Compose stacks.

    $ find . -name "docker-compose.yml" -execdir docker compose up -d --force-recreate \;
    

    Note: --force-recreate was required because the Docker network the existing containers referenced had been deleted.

Preventive Actions

A. Data-Plane Independence (ESPHome)

The data-plane devices were too dependent on the control plane (Home Assistant) for availability.

  • reboot_timeout: 0s: Disabled the panic reboot. If Home Assistant disappears, the light now keeps its last known state (“Sleepy Breath”) even if the API is unavailable.
  • restore_mode: RESTORE_DEFAULT_OFF: Kept the default OFF state for safety (to prevent the light potentially returning to an undesirable bright state after power cuts), but moved recovery responsibility to the automation layer.

B. Restart Resiliency (the “Guardian” Automation)

  • State Persistence: Created a boolean flag (input_boolean.holly_is_asleep) in Home Assistant, set by the bedtime script, that “remembers” Holly is in bed. This state persists across system reboots.
  • Crash Recovery: A new “Guardian” automation runs on system startup.
  • Logic: If System Started AND Holly is Asleep == True → re-execute bedtime routine automation.
  • Smart Recovery: If the system recovers at 2 AM, it bypasses the “Bedtime Classical” playlist and cuts straight to the white noise.

C. Removing the SPOF (High Availability DNS)

Mitigation was frustrated by the fact that the NAS was a single point of failure (SPOF) for our home network DNS.

  • Redundancy: Deployed a Raspberry Pi 3 (home-pi) as a secondary DNS replica.
  • Synchronization: Implemented AdGuardHome-Sync to replicate blocklists and config from the NAS every 5 minutes.

Conclusion

While this was not the highest-stakes operational incident I have dealt with in my career, it is still a useful real-world example of cascading failure. Key lessons from this incident:

  • Decoupling data-plane availability from control-plane availability, in this case the ESPHome night light from the Home Assistant API, is paramount.
  • Redundancy for key services (e.g. DNS) is required no matter how small, low-stakes, or “non-production” the environment seems.
  • One must consider “Who watches the watchman”; Home Assistant can notify me about degraded automation states, but it cannot notify me that it is unavailable.

Shameless plug

If your operational maturity isn’t at a point where you’re routinely using the CAPA process to learn from operational incidents, you might want to hire me to help with that. Get in touch!