31 Dec, 2020

Server Upgrade Takeaways

One of the servers I maintain is a VPS that's been running for around 5 years. Over the last year or two, it's been slowly dying. CentOS 6 was reaching end-of-life after a good 10 years and it was missing a lot of features that I'd expect from a Linux installation. In this post I'll be going through some of my experiences upgrade this server to CentOS 7.

Backstory

As a rule of thumb, I plan dependency upgrades on a monthly basis to avoid future pain as documentation rots and services disappear. But, I never got around to upgrading this server... for multiple reasons.

It's a basic installation that I use to serve static files over HTTPS. There's some Lua scripting to optimise images and handle POST requests for logging and uploads, but that's about it. I've been considering moving all of this functionality to a serverless setup. The files can be stored in block storage and POST requests can be handled by serverless functions. It would reduce a lot of maintenance. Why haven't I done that?

The problem is that the server mostly maintains itself and I prefer paying a fixed price. I don't want to worry about traffic spikes driving up my costs. Why serverless providers don't have a fixed price setup is obvious: uncapped services bring in a lot more money, especially for larger businesses. It's the same reason mobile network providers didn't provide a cap until legislation forced them to. Sadly, even VPS providers are now looking to do the same with bandwidth charges. Sure you can save money by only paying for requests as they come in, but a single spike (DDoS or similar) can kill your wallet. For a business that risk may be manageable but for an individual, I have better things to worry about.

Luckily for this VPS I'm still on the "legacy" plan, so as long as I keep paying and they keep serving I'm good. But that's the thing, the VPS provider can easily retire their legacy plan. So I didn't want to commit to work that might be wasted. What forced my hand was CentOS 6 reaching end-of-life at the start of December. No security updates, no package registry, nothing. The server still covers my needs better than a serverless approach so the best option I had was to upgrade to CentOS 7.

Going Offline

The first step was to alert visitors that the server was undergoing maintenance. I don't want visitors to assume I'm a unreliable developer that can't even keep a server up. Ideally for zero downtime, I could have spun up a new server, set that up then switched over, but none of my services are critical so an advanced warning and a day of downtime isn't a big deal.

For the maintenance page I used Cloudflare Workers which has a generous free tier. I already have all of my traffic going through Cloudflare so it was as simple as setting up a worker and redirecting all of my domains to the worker instead of my server (i.e. *.jahed.dev/*).

const body = `<!DOCTYPE html>
<html>
<head>
<title>jahed.dev</title>
<body>
  <h1>Under Maintenance</h1>
  <p>The server is under maintenance to improve performance and stability.</p>
  <p>Please come back in a few hours or tomorrow at the latest.</p>
  <p><a href="#">More information</a></p>
</body>
</html>
`;

addEventListener("fetch", (event) => {
  event.respondWith(
    new Response(body, {
      status: 503,
      headers: {
        "Content-Type": "text/html",
      },
    })
  );
});

I noticed that it takes a minute or so for the redirect to work, probably due to caching. After that, I re-installed the server and started re-configuring it.

Configuration

Most of the configuration for the server is provisioned from a Git repository. Things like users, groups, sshd, iptables, nginx and so on. There's a bunch of files that I rsync over and a script that moves things to the right place and sets up the permissions. I've considered moving to Puppet, Ansible or similar as a side project but never really got around to it. Adding a layer of abstraction has its costs, and this setup is too simple to counter that.

I was worried that over the last 5 years I may have made a change that I didn't put into version control, but that wasn't the case. Outside of some new defaults, most of it was fine. The difference was that CentOS 7 changed a few commands, which is expected after 5 years, but not well documented. Documentation for CentOS software is generally terrible which I put down to wanting people to pay for RHEL. Can't blame them.

So I went through the setup script and made sure each command was still correct. I pretty much manually executed the script, which is better than running the script and assuming it would work without issue. Would a layer of abstraction solve this? Maybe, if you can trust the maintainers. I'd personally migrate the configuration piece-by-piece anyway to make sure it's behaving as previously expected, which isn't that different from what I did with the script.

Firewall

One of the great things that I came to appreciate about iptables is how it's configured. You can write a single file with all of your rules and apply it using iptables-restore. Most tutorials for iptables has you running individual commands to apply rules, for me it's an odd and confusing way to configure iptables. Like writing a shell script using echo >> file.

CentOS 7 introduces firewalld as a replacement for iptables. The documentation for firewalld isn't great. It introduces common firewall concepts like zones and services. I tried understanding it but I really just couldn't care for it. It's essentially a layer of abstraction over iptables, ipset and other tools. There are XML files to define custom services and zones, and there were a bunch of defaults somewhere that I think need to be overridden if you want something more strict.

It introduces so much complexity that I didn't need. A server has interfaces and ports. My server has 1 interface and needs access to port 22 for SSH and Cloudflare needs access to port 443 for HTTP. Drop everything else. That's it.

Luckily, you can turn off firewalld and go back to iptables, which is what I did, but I worry that one day iptables will be abandoned in favour of firewalld. I hope not.

OpenResty

When I originally setup OpenResty (which includes Nginx), I made a major error by changing its defaults. OpenResty installs itself under /usr/local/openresty including all of its binaries, configuration, pid file and even logs. It doesn't follow any directory standard. So I setup directories in the right places and pointed to them.

Of course, changing defaults makes migrations difficult. I have to set up all of those directories again and make sure it works with the newer release. So I moved everything back to default. Except logs, logs should always be under /var/log for diagnostics.

Making sure everything was moved probably took the most amount of time in this entire migration, but it was worth it and there's a lot less configuration now.

A Mysterious Directory

Often when I run ls some files and directories are printed in different colours and backgrounds. I know it's to do with their permissions but I never really put much thought into it.

On CentOS 7 /var/tmp has the following permissions: drwxrwxrwt. it's easy to miss that last character. It's usually x or -, this one's a t. Under /var/tmp I have /var/tmp/nginx which I use to store on-the-fly optimised images. Nginx has access to this directory for exactly that.

On CentOS 7, Nginx refused to read or write to this directory. At first I thought it was OpenResty's Lua setup. Since OpenResty stopped officially supporting LuaRocks (in favour of its own OPM package manager which is barebones and just fragments the Lua ecosystem -- off-topic rant), maybe my configuration wasn't importing it correctly. Debugging Lua through Nginx is difficult so it took a while. Lots of logging and permutations of HTTP requests.

Eventually I saw /var/tmp was a colour I've never quite seen before with ls. Then I saw that t. What does t even mean? t stands for "sticky bit and executable", whereas T stands for "sticky bit and NOT executable". It's a workaround to avoid adding another character to the ancient permissions string just for the sticky bit and breaking people's scripts. The sticky bit prevents non-root users from deleting or renaming /var/tmp, which makes sense. It should probably be applied to more system directories for consistency but I don't know.

... But Nginx isn't trying to rename or delete anything, it's trying to create files under /var/tmp/nginx which it owns and has full access to so why doesn't it work? Anyways, I decided to move it to /var/cache/nginx which makes more sense and it worked. Whatever's going on with /var/tmp, I'll stay away.

Conclusion

That's about all the notable takeaways. After some rsyncs everything was back to normal.

Thanks for reading.