Yes. Tech is humbling.
And a quick note before we start: this will be technical. The tl;dr:
But, let's backtrack a bit.
If you remember, the month of March brought the biggest outages to date to Magic Pages. That wasn't nice – and all of the outages were down to a single point of failure. From a failed update on a proxy server to wrong abuse messages for one of Magic Pages' servers. March had it all.
So, back then I came up with a simple solution: instead of relying on a single point of failure, let's make sure that the most important Magic Pages services are all replicated in so-called "high availability clusters".
These clusters are basically just tech-fancy-speak for "whenever a server goes down, there is another one to jump in". Easy peasy lemon squeezy, right?
Well, no. Reality is that putting up these clusters is a massive pain. There are countless little gotchas you need to keep in mind. Like, where will the data be stored? If you have two servers running a Ghost website, they need to have a shared space where they can put the theme files. Or, how will the database be handled? Shouldn't that be clustered as well? (The answer is "yes" – but as it turned out, the database was the easier thing.)
It took me the better half of April to figure out these details. To sit down, draw up a plan, test that plan, and calculate how much it would cost (about half of my monthly recurring revenue right now 🙃).
It then took another two weeks to implement that plan. And yesterday, it was time. I migrated the first websites – my own – onto the new cluster infrastructure. Today in the afternoon, I did the same for about a third of the Magic Pages customers.
At first, everything seemed to run smoothly. The websites were all available. There was a tiny issue with some images. But overall, everything was stable. Like I had planned.
Until…a customer reached out and said that a newly published post on their Ghost website wasn't available. That was odd. I checked. And I saw it. I checked again. And it was gone.
Huh?? Checked again…and it was there. Refreshed the page…poof, gone. It's like I played hide and seek with Ghost. One second the post was there, then it was gone.
After thinking this through for a minute or two, I had my lightbulb moment. High availability clustering means that you always try to put two copies of any service up. And…there were two "states of truth". One where the post existed and one where it didn't.
(Yes, the nerds among you will now point out that this was technically Schrödinger's Post – and I hope you'll appreciate that I considered that as a post title for a second 👻)
So, I ran an experiment. What if I told my cluster to only create one version of the Ghost website – and not two? Would this then make the post appear? Or disappear? Or…not change anything at all?
Well, the post appeared. Success 🎉
I could refresh and refresh the page – and the post was there. No issues at all.
Yet, the biggest question remained: why?
I pondered this question for a few hours. I dug deep into Ghost's source code. Analysed hundreds of lines of logs. Asked ChatGPT to help me analyse the situation. Searched forums, Reddit, and StackOverflow.
Then I typed the following into Google: "Ghost CMS cluster"
And there it was: an official Ghost documentation about clustering…I thought…
Except…that documentation actually says the following:
Ghost doesn’t support load-balanced clustering or multi-server setups of any description, there should only be one Ghost instance per site.
Huh? Did I read that correctly? As it turned out…all my efforts were…useless?
I literally felt like somebody punched me in the stomach. It must have been a good minute that I just sat in front of my laptop starring in disbelief.
Not because of what I just read, but because I just realised that I forgot a fundamental step every developer should do: Read. That. Documentation.
Instead of doing that, I went ahead and drafted plans on my own. Instead of consulting the Ghost documentation, I asked ChatGPT to help me analyse the logs and brainstorm reasons for the two different application states.
So yeah, tech is humbling. Especially when you try to chase a non-existent Ghost in an imaginary game of hide and seek.
But, Jannis, what does that mean now? No high availability clustering for Magic Pages?
Thankfully, high availability clustering has two sides to it:
- a control plane ("managers")
- and workers
The managers do a few things, but essentially they tell the workers what to do. For Magic Pages, they essentially say "I need a Ghost website with this configuration". The workers then implement that.
So far, the plan was to always have two instances of that Ghost website ready. So, in case one of the Magic Pages servers goes down, the other one already has a copy and can take over.
With these new restrictions, only a single Ghost website will be created. However, the control plane is still highly available. There are three managers right now, so that there is always some backup. In case the "leader" of the managers goes offline, one of the other two will jump in and start delegating work.
So, in practical terms, this means that an offline worker will not instantly be replaced by another. Now, there will be a small delay of about 5 seconds.
In my eyes, this is still a major win. Better availability for all Magic Pages customers. And one hell of a humbling experience for me 😉