Post-Mortem on 29 January 2026 Outage

Today, on 29 January 2026, from approximately 13:30 UTC to around 22:45 UTC, Magic Pages experienced an extended outage affecting customer websites. This was one of the longer outages Magic Pages has had, and I want to be fully transparent about what happened.

What happened?

Sites started throwing 502 and 503 errors around 13:30 UTC. The issue quickly became clear: the Hetzner load balancer couldn't see healthy endpoints from the Docker Swarm, and therefore refused to route traffic.

Digging deeper, I found that Caddy – the reverse proxy that handles traffic for all 1,200 sites – couldn't see the services it was supposed to route to. The underlying Docker network had become corrupted, preventing Caddy from publishing port 80 correctly.

This network corruption was likely a side effect of earlier troubleshooting I had done to solve a different problem: stale Docker DNS. In Docker Swarm, containers communicate through internal DNS, and there were cases where the reverse proxy was getting outdated IP addresses for container names. My attempts to fix that seem to have triggered the network corruption.

The standard fix for a corrupted Docker network is to remove it from all services, delete it, recreate it, and re-add it. Since sites were already down, I figured: there's no other way to get them back. Let's do it.

I removed the network, deleted it, recreated it. Still corrupted. The network allocator itself was broken – every new network immediately lost its state.

Searching for a solution pointed me to restarting the Docker daemon. With three manager nodes, I could do this one at a time without losing cluster state. So I restarted the daemon on the first manager node. Docker promptly lost contact with the other nodes. The swarm fell apart.

I tried to rebuild the swarm while keeping all data intact. That worked, technically. But the Raft consensus data had grown so large that it overwhelmed each manager node whenever a second one tried to join. With 4 CPU cores and 8GB of memory per node – normally plenty – the resource usage spiked to the point where only one manager could be active at a time.

New plan: take a Hetzner snapshot and spin up a manager with more resources. But Hetzner had no capacity available in the Falkenstein datacenter, so I had to use Nuremberg instead. The snapshot took about 20 minutes. Spinning up the new server took another 20 minutes. And because the new server was in a different datacenter, I couldn't reuse the original external IP address – which I needed for the TLS-authenticated deployment scripts.

By 18:00 UTC, I had spent hours troubleshooting what turned out to be an exotic series of cascading failures. At that point, I made a decision: stop debugging, start rebuilding.

How did I fix it?

I created an entirely new Docker Swarm cluster from scratch. The key insight here is that the swarm itself is just an execution layer – it doesn't hold any customer data. All content data lives on Ceph, and all databases are on a separate external cluster. The swarm is replaceable.

The new cluster came up quickly. All 1,200 sites were deployed from configuration within about 5 minutes.

But then: similar issues. The Hetzner load balancer still couldn't see the health endpoints and refused to route traffic.

This time, I took a more surgical approach. The Caddy Docker Proxy uses a merge approach: it has a default Caddyfile (which in my case defines ActivityPub and Traffic Analytics routes for all sites) and then merges that with label-based configurations for every Ghost site.

The problem: no useful logging. When I dug in from scratch, I discovered that the Caddyfile had syntax issues that prevented the Caddy proxy from starting properly – but no errors were thrown and the container reported as healthy.

After fixing the configuration, Caddy still wouldn't start correctly. This time: CPU saturation. All 8 cores were fully utilised because 1,200 Ghost sites being added to Caddy at the same time generated too many Docker events.

The solution was straightforward once I understood the problem: batch the deployments. I temporarily scaled down some Ghost sites, let Caddy catch up, and then brought sites back online in groups.

By 22:45 UTC, all sites were back online.

What worked well?

The data layer held solid. Customer content on Ceph and databases on the external cluster were completely unaffected. No data was lost.
Cloudflare's caching performed exactly as hoped. The migration to Cloudflare that I've been working on paid off during this incident. Most sites' frontends were cached and served to visitors with minimal interruption – a noticeable improvement over Bunny.net, where caching sometimes missed assets like JavaScript or images.
The "rebuild from scratch" decision saved time. Once I committed to spinning up a new cluster instead of continuing to debug the corrupted one, recovery was much faster.
Configuration as code. Having all 1,200 site configurations stored externally meant I could redeploy everything in minutes once the infrastructure was ready.

What could be improved?

The networking stack needs work. Between us: I'm not 100% confident in it. I've had issues with every reverse proxy I've tried – Traefik created stale DNS issues leading to wrong routings, Caddy standalone ran into stale DNS when I migrated more sites from Kubernetes to the Swarm, and now the Caddy Docker Proxy, while promising to circumvent Docker's internal DNS, has its own quirks. The label merge approach feels a bit flimsy.
Better logging in the Caddy Docker Proxy. Configuration errors should surface clearly, not fail silently while reporting the container as healthy.
Deployment batching should be default. Adding 1,200 sites simultaneously is clearly too much. The deployment process needs built-in throttling.
Faster decision-making. In hindsight, I spent too long trying to fix the corrupted swarm before deciding to rebuild. The execution layer is replaceable – I should have made that call sooner.

Future Prevention

Based on this incident, I'm taking the following steps:

Review the networking architecture. I need to investigate alternatives to the current reverse proxy setup, or at least implement better health checks and logging.
Implement deployment batching. Caddy needs time to process Docker events. I'll add automatic throttling when deploying large numbers of services.
Document the "rebuild threshold." When a Docker Swarm issue looks exotic and the swarm is an execution layer with data stored elsewhere, rebuilding is often faster than debugging. I need clearer criteria for when to make that call.
Improve Caddy configuration validation. Ideally, configuration errors should fail loudly during deployment, not silently at runtime.

It's been a shitty day. For you as customers, for me sitting here trying to solve this, for your visitors who might have experienced interruptions. I apologise for the downtime and the stress this caused.

The one silver lining: this incident validated the Cloudflare migration. Seeing cached frontends stay available during an infrastructure meltdown is exactly the resilience I was hoping for.

If you're still seeing any issues or have questions, please send me a quick email to [email protected].

PS: There is still an issue with routing ActivityPub and some Traffic Analytics data. For now, I want the stack to stabalize and not add any more stress to it. I'll get to that tomorrow, promise.

Post-Mortem on 29 January 2026 Outage

What happened?

How did I fix it?

What worked well?

What could be improved?

Future Prevention

Jannis Fedoruk-Betschki

You Might Also Like

A New Chapter for Magic Pages

Why You Shouldn't Put a CDN in Front of Magic Pages

Beware of Ghost phishing links

Websites powered by Magic Pages