Post-Mortem on 24 June 2025 Partial Outage

Today, on 24 June 2025, from approximately 14:06 UTC to around 22:00 UTC, Magic Pages experienced an outage affecting a subset of websites. This was caused a normal Ghost update, which then cascaded into infrastructure becoming unavailable.

What happened?

Whenever the Ghost sites on Magic Pages get updated, they technically need to start fresh. This happens in two steps. First, a new "pod" for a website is created within the Kubernetes environment. That can be anywhere on any server. Once that new pod with the updated Ghost version is stabalised, the old one is removed. This way, the site is always online and the cutover is as smooth as possible.

However, this also means that there is usually a spike in resource usage whenever the sites update. Kubernetes has a great option though: autoscaling. When the Kubernetes environment needs more resources, it automatically provisions them from Hetzner Cloud, my infrastructure provider.

This worked normally – until it didn't. The updates happen in batches. About 120 sites were updated to the new versions. Then errors started to appear. The "updated pods" could not be scheduled.

What that means is that none of the servers have enough resources. At first, I wasn't too worried. In the Hetzner console it showed that there was some "emergency maintenance" going on at their load balancers. While not directly related, I could not rule out that this had no effect.

After a few minutes, I saw that this maintenance was finished. However, now there was "planned maintenance" in their cloud console, the dashboard I use to manually check servers, order new ones, etc.

Things weren't completely offline, but they were slow and the API (to automtically provision new servers) was rejecting requests from time to time. Fair, this can happen during maintenance.

At this point, the 120 websites that were updating already were offline. After about 15 minutes, things in the Hetzner cloud console recovered. In the meantime, the existing server nodes ran into a disk pressure issues, meaning that they needed to be removed.

Again, a routine operation. Remove nodes that were faulty (since it is quicker to just order new ones, rather than manually salvaging things), have the autoscaler provision new ones. But...nothing happened.

I checked the autoscaler logs and saw that Hetzner reported that no servers of the type I needed (CAX41) were available. In fact, no ARM-based servers (which I use for environmental and economical reasons) were available at all.

That was problematic. The entire infrastructure was based on the fact that one of the three Hetzner data centers (Falkenstein, Nürnberg, Helsinki) in Europe would have CAX41 instances available.

In fact, since I started using Hetzner, I have never seen none of these servers being available.

The situation didn't get better. So, around 15:00 UTC I started adding AMD64 servers, though they are considerably more expensive and come with one big caveat: the images that run the Ghost sites needed to be rebuilt.

The problem was that I needed multi-architecture images for both the ARM and AMD64 server nodes. Getting that right took time and many tries. Every try had to run through the build process – which took about 30 minutes each.

Around 18:00 UTC I managed to get this sorted. The multi-arch image was ready.

Since then, I have slowly been bringing sites back up. Due to the cascading issues, the underlying storage issue, Longhorn, also ran into issues, which I also needed to deal with.

(Side-note: another reason to rebuild the infrastructure).

Now, while I have been dealing with Longhorn, ARM instances became available slowly, though, I had to be careful to not overload the storage engine with tasks.

By 23:00 UTC all sites are online again. Thank you for your patience!

My day did end a little later though. Longhorn as a storage engine needs 3 replicas of a given storage volume to properly function. That is just well...good practice. While the sites were all back up, some of them only had 2 replicas available. In the aftermath I spent a significant amout of time going through around 40 sites one by one to see where the issue was.

Mostly, there were resource constraints. I fixed the most pressing ones right away, and finally went to sleep around 02:30 UTC (to be fair, this was the first really long night in the history of Magic Pages).

Future Prevention and Next Steps

Based on this incident, I am convinced that the right steps for Magic Pages is a complete overhaul of the infrastructure, as outlined on my personal blog. Bigger, dedicated servers tend to be more reliable and fault-tolerant compared to smaller cloud intances.

Right now, the current Kubernetes cluster is absolutely overprovisioned. This will stay like this for the foreseeable future, to that storage issues are minimised.

At the same time, I spent some time in the next morning analysing the resource constraints in Longhorn, which were the issue why some of the sites took so long to get back up. I discovered another cascading effect:

As the Ghost sites start, they spike in CPU usage. However, at the same time, Longhorn needs CPU on the server that the Ghost sites is scheduled on to make sure that the storage volume is available there.

Since there was no default limit in Longhorn, hundreds of sites (in this specific case dozens) could restart at the same time – and Longhorn would try to force the volumes onto the same server as well.

This led to CPU exhaustion, which in turn made the servers unavailable for 1-2 minutes. This interruption then led to Kubernetes and Longhorn looking for other places to schedule the sites. Ugh...you see where this is going.

So, I spent some time to go through the Longhorn configuration and limited the amount of parallel operations. There is now also a dedicated CPU and memory budget per server that will never be touched by Longhorn, so that the server itself always stays responsive.

So far, this is looking good. Replica recovery in Longhorn takes longer, however, it is also significantly more reliable.

This outage was disruptive and stressful, both for me and undoubtedly for you. I want to apologise for the downtime and the lack of access to your Ghost sites during this period.

If you still see any issues or have questions, please send me a quick email to help@magicpages.co.

Magic Pages Blog Incidents

About Jannis Fedoruk-Betschki

I'm the founder of Magic Pages, providing managed Ghost hosting that makes it easy to focus on your content instead of technical details.

Post-Mortem on 24 June 2025 Partial Outage

What happened?

Future Prevention and Next Steps

About Jannis Fedoruk-Betschki

You might also like

April & May 2025 Update on Magic Pages

How I Got ActivityPub To Run on Magic Pages

Post-Mortem on 14 April 2025 Outage

Websites powered by Magic Pages