29 Oct, 2019

Improving FrontierNav's Tests

Previously FrontierNav's tests ran against the local development build. This meant I only had to maintain two environments: development and production.

Production builds are optimised for delivery across a wide range of browsers and networks. The builds are obfuscated, minimised and split up; making them difficult to debug.
Development builds are optimised for debugging. They contain a lot more logging and are less optimised to avoid adding layers of transformation over the original source code.

These differences meant that in some cases, the optimisations applied to production would cause problems, and since I'm not testing against production, the only way to know was to test it manually or check the error logs. Sometimes I'd pick up the problem immediately, other times, it'd take days to discover.

The solution is obvious: test the production build. This week, I finally got around to doing it.

Production but not really

First issue is that I don't really want to test in production. I want to test the build, but I don't want to affect users. Users should not see a "Test User" going around the app, creating and deleting things to make sure it all works. So I created a new "environment" to differentiate between production and what's typically called "staging", which points to different databases and services.

A New Environment

The concept of an environment is extremely bloated. Node.js has its own "NODE_ENV" which is handled by most libraries and frameworks as a boolean; it's either "production" or not. If you want a production-optimised build "NODE_ENV=production" is needed, so I'll need to keep it around. Setting it to "staging" would be no different from setting it to "development". I need it to be "production" for "staging".

This new FrontierNav-specific environment, "FN_ENV", can be set directly or it'll fallback to whatever NODE_ENV is. In short, NODE_ENV defines the build, whereas FN_ENV defines the deployment.

# FN_ENV will fallback to "development"
NODE_ENV=development webpack

# Explicitly set FN_ENV to "staging"
NODE_ENV=production FN_ENV=staging webpack

# FN_ENV will fallback to "production"
NODE_ENV=production webpack

I could have added a step to the execution that only takes an FN_ENV and calculates the NODE_ENV, but since these flags are only set in a few places, adding and maintaining that additional layer is excessive right now.

Separating Configuration

The next step was to pull out all of the variables -- things like the database, API keys and image servers -- into multiple configuration files, one for each environment.

I could have the configuration be fetched at runtime. However, while this would allow a single "production" build be used for both "staging" and "production" deployments, it also means I'd need a system to deploy and rollback the configuration separately, adding to the complexity.

A solution could be to have a two step build: One for the common code, and another for each environment which passes a configuration to the first build. I looked into this a bit, but getting it work, with all the optimisations, code splitting, etc. seemed too complicated. It's easier to manage one build per environment.

The downside to having to build for each environment is that it multiplies the time waiting for builds, which is currently just below 2 minutes each. (I'll go into this in a bit.)

There's also no guarantee that the two builds are identical, there might be bugs in the minifier or some other step that causes a behavioural difference. But these are rare cases that can exist throughout the software pipeline so it's not worth acting on them until they actually cause problems.

Staging Deployment

Finally, I had to make a deployment as close to production as possible. This meant adding a new domain and another HTTP server configuration. Previously, I was testing the project locally so none of the server configuration was tested (Nginx and Cloudflare). Now there are!

Optimising Build Times

To reduce Webpack's build times for "staging" and "production", I first looked into caching. Since both of these environments are essentially the same, minus a few variables, caching the results seemed like an obvious start. Caching does however have some issues.

Cache invalidation is always a problem. The cache identifier needs to be accurate to avoid using stale caches between builds. Managing that identifier is really complicated. For example, for Babel, I'll need to know the current version of Babel, all of plugin versions, the configuration and the browser targets (which is separate from the configuration as it's used by other tools). If anything new comes into play, I'll need to remember to add that to the list. Maintaining all of that would be a headache, and a huge risk, if say "production" variables go into "staging" and the tests end up polluting the entire database.

Second, looking up the cache can take more time than rebuilding a file. A lot of modules in JavaScript are tiny, so for all those tiny files, the overhead of doing a file system lookup can increase rather than decrease the build time for that file.

Overall, I did see a ~10% reduction in build time. However, that was the difference between 110 seconds and 100 seconds, still over a minute and in real terms, it won't affect my workflow.

Concurrency was another option. By building each file in parallel, I saved around 20 seconds. But again, not really noticeable in real terms.

Real Terms

I mentioned "real terms" when it comes to build times. For me, this asks the question: Does it make me more efficient?

If a release takes maybe, more than 30 seconds, I'm going to be doing other things. Taking a break, planning the next step, reading up on something, etc. When I'm at that step, I probably won't get back to development for around 5 minutes at the least. So having a release take 60 seconds instead of 300 seconds makes no difference to me. It's still more than 30 seconds and less than 5 minutes. Adding more complexity without any real benefit doesn't make sense.

Automated Deployments

There's no doubt that as I introduce more test cases, a release will start taking well beyond 5 minutes. At that point, I'll need to change my workflow. Instead of finding something else to do while a release goes through, I should be able to start on the next thing and let the release alert me when it's done.

To do that, I'll need a continuous integration server, otherwise my open changes would conflict with the release. But maintaining a CI server and introducing a new workflow is a lot of work, so I'll cross that bridge when I get there.

Thanks for reading.