jahed.dev

Fixing NodeJS multi-threading on CircleCI

I've been a part of multiple teams maintaining CircleCI pipelines for the past year or so. We often encountered a recurring issue: flakey unit test runs in NodeJS projects.

They ran fine locally, but failed randomly in CircleCI. It seemed as though the tests were competing for resources, a lot of them were old and hit the disk, so we made them run sequentially in CircleCI. In Jest that's done by adding --runInBand. That always fixed the issue, but it made the tests slower as they didn't use the multiple cores available to us.

In a newer Vue project, we had similar issues, even after migrating from Jest to Vitest, we encountered it almost daily. In fact, it seemed to fail more frequently. These were new tests which weren't hitting the disk nor network so it became even more confusing. We tried increasing machine sizes to no avail.

After many brief searches across Google and GitHub Issues while we waited for tests to pass, we bit the bullet and made Vitest also run sequentially. In Vitest, that's done by setting maxThreads and minThreads. Again, the tests became reliable, but it was slower. This time, we knew it wasn't our tests, nor resource contention. So what was it?

Looking deeper into where Vitest uses maxThreads and minThreads, there's this code:

import { cpus } from "node:os";

// ...

const threadsCount = ctx.config.watch
  ? Math.max(Math.floor(cpus().length / 2), 1)
  : Math.max(cpus().length - 1, 1);

const maxThreads = ctx.config.maxThreads ?? threadsCount;
const minThreads = ctx.config.minThreads ?? threadsCount;

Knowing this, we set maxThread and minThreads to what we expect cpus().length to return. On a large instance, that's 4. The tests continued to pass reliably. For some reason cpus().length, was returning a different number. To confirm, we SSH'd into an instance and ran that line. It gave us 36! But why?

A quick Google search lead to this CircleCI issue dating back to 2018: "Have nproc accurately reporter number of CPUs available to container". So apparently, CircleCI's nproc is returning the number of cores on the host rather than the container. Other CIs, like Travis CI, ensure nproc returns the container's cores. Mystery solved.

A lot of NodeJS tooling uses cpus().length as a default for parallelism. All of them will be somewhat broken on CircleCI until the issue is fixed. For now, the best solution is to find the relevant configuration and override it.

{
  maxThreads: process.env.CI ? 4 : undefined;
}

Maintaining CI-specific configuration like this across so many tools is tedious, so hopefully CircleCI gets around to fixing it.

Thanks for reading.