Not For Scale? How I replaced NFS storage on Magic Pages

As most of you know, I am running Magic Pages, a managed Ghost CMS hosting platform. And if you've followed this blog, you might also remember that for about a year or so, the entire hosting is run on Kubernetes.

Kubernetes is kind of like an orchestra conductor. But instead of musicians, Kubernetes manages many servers, telling them what to do, distributing work, and making sure everything runs smoothly. If a "musician" (well...server) breaks down, Kubernetes brings in a replacement so the music never stops

But apart from playing music (e.g. serving websites), Kubernetes also needs a place to store files. In this orchestra analogy, think of storage as the sheet music library. Every musician needs access to the right notes at the right time. Without sheet music, our orchestra would be in total chaos. Nobody would know what notes to play. Granted, some of it might be fresh in their minds (cached), but usually, even the best musicians cannot remember an entire symphony by heart.

For the past year, Magic Pages has been using something called NFS (Network File System) for storage. An NFS would be like having one central music librarian with all the sheet music. Every musician had to get their music from this single person, handing them out in the beginning of a concert. And that worked well enough, until...

...the Librarian Got Overwhelmed

Over time, there were two problems that popped up every now and then. Sometimes, there were occasional blackouts. While NFS as protocol is quite good (in comparison to how it was a few years ago), I noticed that there were hiccups with higher loads, in particular when I ran updates on Ghost sites.

What happened (as far as I can reconstruct this) was that multiple Ghost sites were updated to a new Ghost version, and had to write and request loads of files at the same time. This then led to some connectivity issues and well...boom.

The NFS server had a hiccup of just 1-2 seconds – but the Ghost sites couldn't connect anymore, running into errors.

Now, Kubernetes – our conductor – has the ability to self-heal. But since NFS is so loosely integrated, this never properly worked, in my experience.

The second issue I am slowly running into is space. Magic Pages grew to over 500 sites, and yeah...storage capacity is limited. So far, my idea was to simply order bigger drives from my server provider. But, I have now taken this as an opportunity to rethink how I want to store the data for all the Ghost sites running on Magic Pages.

Storage Classes and Longhorn

In Kubernetes, there's this concept called "Storage Classes" - think of it as different ways to organize our music library. The NFS approach I was using is just one way - having a single librarian with all the sheet music. But there are other, more resilient approaches.

For example, Longhorn – it is a more distributed storage system for Kubernetes, and a lot more native to the Kubernetes landscape than NFS (which was never built for this, in the first place). Longhorn is like having multiple librarians, each with copies of all the important sheet music. If one librarian gets overwhelmed (or a storage server fails), the musicians can still get their music from another librarian without missing a beat.

But even better, Longhorn fully understands how our conductor works and integrates seamlessly with it. When a server goes down, Longhorn doesn't just panic (like NFS would – it would just crash and say byw 👋). It automatically ensures that the data remains accessible from other servers, and – if necessary – creates new copies of our data in different places.

But this sounds too good to be true, right? Well, there is a caveat.

Switching Librarians Mid-Concert

Once I decided to move from NFS to Longhorn, I faced a challenge I tried to ignore up to that point: how do you change your entire storage system without interrupting the concert

With over 500 Ghost websites relying on the current storage system, I couldn't just say "Sorry everyone, we're going offline for a day while I rearrange things." That would be unacceptable for a service that promises reliability.

I have already talked about a similar challenge in this post, related to switching the database engine while keeping the lights on:

Initially, I wanted to use a tool built for moving storage from one storage class to another – I tested it on one of my test sites, and it worked well:

However, it being a CLI (command line interface) meant that I would have to manually run this command for every single site. Ughhh...

So, I needed to find a way to:

Create new storage spaces in Longhorn
Copy all the existing data from NFS to Longhorn
Switch each Ghost site to use the new storage
Verify everything works
Clean up the old storage

All while keeping every site online and functioning normally. No biggie, right?

Building a Migration Script

I knew that doing this manually for 500+ sites would be both error-prone and take forever. So I built a migration script within my backend.

Since the exact steps will always depend on your specific situation, I see little point in sharing the exact script, but here is the rough breakdown:

Find all the storage volumes (PVCs in Kubernetes speak) used by a Ghost site.
Identify the correct volume we want to migrate.
Set up equivalent volumes in Longhorn.
Copy all data from NFS to Longhorn using a Kubernetes job that well...simply runs good old rsync.
Modify the Ghost site to use the new storage (this was the trickiest part, since I ran into issues with my own deployment process...which ended up in 2 days of refactoring 😂).
One more data copy to catch any changes that happened during the update. That's important for bigger sites – moving a site with a few Gigabytes of data can take a few minutes. During that time, the data on file can change.
Keep both storages around for a while to ensure everything works (7 days, in my case – so, if issues arise, I can always switch back).
Remove the old NFS storage once I am confident.

Overpromising

One interesting aspect of Longhorn is how it handles storage allocation. While Magic Pages has an unlimited storage policy, Kubernetes doesn't work this way. It always wants to know how much storage a volume can have. NFS didn't care about that. As I mentioned, it was loosely integrated into Kubernetes, so I just slapped a 5GB storage allocation on every volume and tada...NFS never bothered when a site went over that limit.

I wanted to keep this for Longhorn as well, since the average site on Magic Pages just has 500MB of storage. Far from the 5GB limit.

🤔

Yes, 5GB is not unlimited. And yes, some of my customers go above that 5GB limit, with the biggest site currently using just shy of 51GB in high quality images. This is factured into the actual implementation in my backend. If a site needs more storage than these 5GB, it will get it, so this is a simplification 😉

With 500 websites, each allocated 5GB of space, you'd think we'd need at least 2.5TB of raw storage. Then, we want to have 3 replicas of that storage, so we have high-availability (basically, every musician has two backups readily available at any point).

So...7.5TB?

Thankfully, not. Storage is cheap nowadays, but not THAT cheap.

One deep-dive into Longhorn later, I learned about a feature called "Storage Overprovisioning." It's a bit like airlines selling more tickets than there are seats on the plane – they know that statistically, some people won't show up (and sometimes you can snag a nice compensation, if somebody does have to get bumped).

In storage terms, Longhorn knows that most of my volumes won't actually use their full 5GB allocation, while others might be over that limit. So I can "promise" more storage than I physically have available.

Despite allocating "on paper" almost 7.5TB of storage (500 sites × 5GB × 3 replicas), after the migration, I am only using about 1.2TB.

And that's the beauty of thin provisioning (yes, a new buzzword I picked up while learning about Longhorn): Longhorn only consumes what it actually needs. More on that here:

Longhorn | Architecture and Concepts

Longhorn

The Migration Experience

So, I started the big move. I decided to do it in batches, starting with my own sites as guinea pigs (e.g. this blog). Then I moved to bigger customer sites in small groups. The first batch went mostly smoothly, but there were a few hiccups:

There were some weird temporary files in some volumes that caused issues.
A few sites had symlinks that needed special handling (technically, all Ghost sites have symlinks, due to how the Source and Casper themes are handled, but for some – mostly migrated sites from other managed hosting services – these behaved strangely).
Some sites seemed to have corrupted theme files (uploaded by the customers) that worked fine on NFS but caused issues on Longhorn (more strict file system checks, I assume?)

After a few days of migrating sites in increasingly larger batches, all 500+ Ghost sites are now running on Longhorn's storage class 🎉

But What Does It Mean?

The most noticeable improvement is reliability – no more random NFS disconnects causing sites to go down. This has only happened 2-3 times, but it does give me peace of mind.

When a server has issues now, Longhorn serves files from another replica within the blink of an eye. That self-healing aspect is a game-changer compared to the NFS server, and highly welcome for an upcoming holiday in May.

Performance is also noticeably better, especially under load. The distributed nature of Longhorn means file access is spread across multiple servers rather than well...one. This is particularly visible during the Ghost updates, I mentioned earlier.

Perhaps the most valuable insight from this migration is that infrastructure decisions have cascading effects. What worked for 100 sites didn't scale well to 500. Duhh...should have known that, right? Well, in retrospect we're always smarter. Honestly, I am just glad I invested the time to do this migration now, rather than when problems became more and more prominent 😅

Not For Scale? How I replaced NFS storage on Magic Pages

...the Librarian Got Overwhelmed

Storage Classes and Longhorn

Switching Librarians Mid-Concert

Building a Migration Script

Overpromising

The Migration Experience

But What Does It Mean?

A Tale of Self Sabotage

Incorporating Magic Pages

Not For Scale? How I replaced NFS storage on Magic Pages

...the Librarian Got Overwhelmed

Storage Classes and Longhorn

Switching Librarians Mid-Concert

Building a Migration Script

Overpromising

The Migration Experience

But What Does It Mean?

Read next

A Tale of Self Sabotage

Incorporating Magic Pages