On the night of 19–20 May, around 200 Magic Pages sites went offline for about four and a half hours. I want to be upfront about what happened, why it took as long as it did to fix, and what we've already changed so it doesn't happen the same way again.
What happened
Magic Pages runs on dedicated Hetzner servers ("workers") that host the actual sites. Sites are spread across them, so each worker handles roughly 200-300 sites. This is deliberate − if one worker has a problem, only that slice of sites is affected, not the whole platform.
At 22:01 UTC on 19 May, one of the two NVMe drives inside a worker died. The server has two drives in a mirror, so losing one shouldn't bring anything down. That's the entire point of the mirror. But during the failure, the dying drive didn't fail cleanly. Instead, it hung for about a minute, locking up the server's disk system long enough that the ~200 Ghost containers running on that worker crashed and disappeared.
When the dust settled, the server was still up and the surviving drive was fine, but Docker had lost track of those containers and was waiting for someone to bring them back. No customer data was lost. The failed drive was a boot drive. Site content and databases live on separate storage and were never at risk.
The sites on the other workers were unaffected throughout.
Why this lasted four and a half hours instead of minutes
The hardware failed at 22:01 UTC. The fix was a five-minute job – restart the containers on the affected worker – but I didn't see it until about 4.5h later. Nothing in our monitoring was watching for this specific failure mode, so no alert ever fired.
We monitor a lot of things (CPU, memory, disk space, storage cluster health, request latency, actual reachability of the Ghost sites), but we weren't watching the drives themselves for early warning signs or failure. If we had been, we'd most likely have caught this drive degrading days or weeks before it actually died – NVMe drives almost always show wear indicators before they fail outright.
So this wasn't really a hardware story. It was a monitoring gap. The hardware failed exactly the way hardware is allowed to fail, but we weren't watching.
The "reachability" monitoring didn't fire because the Ghost frontends were accessible due to the Cloudflare caching. This means, we also need additional monitoring for the Ghost Admin route. So far, my reasoning was "we want to know when a Ghost frontend is down". That worked reasonably well with the old Bunny.net CDN, but Cloudflare correctly caches frontends when the backend is gone.
What we've already changed
I spent the morning after the incident closing this gap. As of today:
1. We now monitor drive health on every server.
Every Magic Pages server now runs an additional disk health monitoring that reports drive wear, error counts, and the manufacturer's own end-of-life signals to our central monitoring system. This covers all our production workers, all the storage cluster nodes, and the boot drives on each of them.
2. We also monitor Ghost backends now.
In addition to a Ghost site's container health on an infrastructure level and the full "can I reach the site through the public internet"-chain (frontend monitoring), we also test the Ghost backend endpoints now, even when the frontend is reachable.
3. We now get alerts when these new monitors fail.
The alerts are tiered:
- Critical (page immediately): actual reliability problems when drives fail completely or a backend isn't reachable, even though the frontend is.
- Warning (less critical alerting on secondary channels): drives approaching end of warranty, sustained high temperatures, etc.
We tested the full chain end-to-end before relying on it and all the scenarios that failed at night sent an alert within 30 seconds.
4. We can also see drive health at a glance.
There's now a new dashboard in our monitoring suite (always up on a screen in my home office) showing wear, temperature, spare capacity, and critical warnings for every drive in the fleet. In contrast to the alerts, this gives us an early indication of how things are going.
5. The failed drive is replaced.
Hetzner replaced the dead drive on the failed worker in the morning. The mirror has been fully rebuilt onto the new drive and both halves are healthy.
What the new monitoring is already telling us
The monitoring system, within minutes of being switched on, surfaced two things worth knowing:
- Another drive is at 91% of its rated write endurance. Not failing, but worth keeping an eye on. We'll plan a proactive replacement before it hits the same issue as on the other worker.
- Two other drives on a storage node are well beyond their manufacturer's warranty (they've been written to more than originally rated for). The drives' own self-checks say they're still healthy, but we're tracking these and will rotate them out at a calm moment in the next few days.
In both cases we found this before anything broke, which is exactly why we set up the monitoring properly now.
Four and a half hours of downtime for ~15% of sites is a serious incident, even if it's not platform-wide. Better monitoring will solve this particular issue in the future. Hardware can and will fail – but we need to know about it sooner than we did last night.
If your site was affected and you'd like to talk through anything specific, including coverage under our SLA, please reach out at [email protected].
Jannis