Don’t just keep trying; report your limits

The other day we had problems with a service dying. It ran out of memory, crashing and failing to respond to all open requests. That service was running analyses of repositories, digging through their files to report on the condition of the code.

It ran out of memory trying to analyze a particular large repository with hundreds of projects within it. This is a monorepo, and Atomist is built to help with microservices — code scattered across many repositories.

This particular day, Rod Johnson and I paired on this problem, and we found a solution that neither of us would have found alone. His instinct was to work on the program, tweaking the way it caches data, until it could handle this particular repository. My reaction was to widen the problem: we’re never going to handle every repository, so how do we fail more gracefully?

The infrastructure of the software delivery machine (the program that runs the analyses) can limit the number of concurrent analyses, but it can’t know how big a particular one will be.

However, the particular analysis can get an idea of how big it was going to be. In this case, one Aspect finds interior projects within the repository under scrutiny. My idea was: make that one special, run it first, and if there are too many projects, decline to do the analysis.

Rod, as a master of abstraction, saw a cleaner way to do it. He added a veto functionality, so that any Aspect can declare itself smart enough to know whether analysis should continue. We could add one that looks at the total number of files, or the size of the files.

We added a step to the analysis that runs these Vetoing Aspects first. We made them return not only “please stop,” but a reason for that stop. Then we put that into the returned analysis.

The result is: for too-large repositories, we can give back a shorter analysis that communicates: “There are too many projects inside this repository, and here is the list of them.” That’s the only information you get, but at least you know why that’s all you got.

And nothing else dies. The service doesn’t crash.

When a program identifies a case it can’t handle and stops, then it doesn’t take out a bunch of innocent-bystander requests. It gives a useful message to humans, who can then make the work easier, or optimize the program until it can handle this case, or add a way to override the precaution. This is a collaborative automation.

When you can’t solve a problem completely, step back and ask instead: can I know when to stop? “FYI, I can’t do this because…” is more useful than OutOfMemoryError.