Don’t just keep trying; report your limits

The other day we had problems with a service dying. It ran out of memory, crashing and failing to respond to all open requests. That service was running analyses of repositories, digging through their files to report on the condition of the code.

It ran out of memory trying to analyze a particular large repository with hundreds of projects within it. This is a monorepo, and Atomist is built to help with microservices — code scattered across many repositories.

This particular day, Rod Johnson and I paired on this problem, and we found a solution that neither of us would have found alone. His instinct was to work on the program, tweaking the way it caches data, until it could handle this particular repository. My reaction was to widen the problem: we’re never going to handle every repository, so how do we fail more gracefully?

The infrastructure of the software delivery machine (the program that runs the analyses) can limit the number of concurrent analyses, but it can’t know how big a particular one will be.

However, the particular analysis can get an idea of how big it was going to be. In this case, one Aspect finds interior projects within the repository under scrutiny. My idea was: make that one special, run it first, and if there are too many projects, decline to do the analysis.

Rod, as a master of abstraction, saw a cleaner way to do it. He added a veto functionality, so that any Aspect can declare itself smart enough to know whether analysis should continue. We could add one that looks at the total number of files, or the size of the files.

We added a step to the analysis that runs these Vetoing Aspects first. We made them return not only “please stop,” but a reason for that stop. Then we put that into the returned analysis.

The result is: for too-large repositories, we can give back a shorter analysis that communicates: “There are too many projects inside this repository, and here is the list of them.” That’s the only information you get, but at least you know why that’s all you got.

And nothing else dies. The service doesn’t crash.

When a program identifies a case it can’t handle and stops, then it doesn’t take out a bunch of innocent-bystander requests. It gives a useful message to humans, who can then make the work easier, or optimize the program until it can handle this case, or add a way to override the precaution. This is a collaborative automation.

When you can’t solve a problem completely, step back and ask instead: can I know when to stop? “FYI, I can’t do this because…” is more useful than OutOfMemoryError.

Stick with “good enough,” until it isn’t

In business, we want to focus on our core domain, and let everything else be “good enough.” We need accounting, payroll, travel. But we don’t need those to be special if our core business is software for hospitals.

As developers, we want to focus on changing our software, because that is our core work. We want other stuff, such as video conferencing, email, and blog platforms to be “good enough.” It should just work, and get out of our way.

The thing is: “good enough” doesn’t stay good enough. Who wants to use Concur for booking travel? No one. It’s incredibly painful and way behind modern web applications that we use for personal travel. Forcing them into an outdated travel booking system holds your people back and makes recruiting a little harder.

When we rent software as a service, then it can keep improving. I shuddered the last time I got invited to a WebEx, but it’s better than it used to be. WebEx is not as slick as Zoom, but it was fine.

There is a lot of value in continuing with the same product that your other systems and people integrate with, and having it improve underneath you. Switching is expensive, especially in the focus it takes. But it beats keeping the anachronism.

DevOps says, “If it hurts, do it more.” This drives you to improve processes that are no longer good enough. Now and then you can turn a drag into a competitive advantage. Now and then, like with deployment, you find out that what you thought was your core business (writing code) is not core after all. (Operating useful software is.)

Limiting what you focus on is important. Let everything else be “good enough,” but check it every once in a while to make sure it still is. Ask the new employee, “What around here seems out of date compared to other places you’ve worked?” Or try a full week of mob programming, and notice when it gets embarrassing to have six people in the same drudgery.

You might learn something important.

Library vs service: who controls change?

When you have a common piece of functionality to share between two apps, do you make a library for them to share, or break out a service?

The biggest difference between publishing a library or operating a service is: who controls the pace of change.

If you publish a library for people to use, you can put out new versions, but it is up to each application’s team to incorporate that new version. Upgrades are gradual and may never fully happen.

If you operate a service, you control when upgrades happen. You put the new code in production, and poof, everyone is using it. You can upgrade multiple times per day, and you control when each is complete.

If you need certain logic to be consistent between applications, consider making a service. Control the pace of change.

(For one flipside, see: From Complicated to Complex)

From complicated to complex

Avdi Grimm describes how the book Vehicles illustrates how simple parts can compose a very complex system.

Another example is Conway’s Game of Life: a nice organized grid, uniform time steps, and four tiny rules. These simple pieces combine to make all kinds of wild patterns, and even look alive!

These systems are complex: hard to predict; forces interact to produce surprising behavior.

They are not complicated: they do not have a bunch of different parts, nor a lot of little details.

In software, a monolith is complicated. It has many interconnected parts, each full of details. Conditionals abound, forming intricate rules that may surprise us when we read them.

But we can read them.

As Avdi points out, microservices are different. Break that monolith into tiny pieces, and each piece can be simple. Put those pieces together in a distributed system, and … crap. Distributed systems are inherently complex. Interactions can’t be predicted. Surprising behaviors occur.

Complicated programs can be understood, it’s just a lot to hold in your head. Build a good enough model of the relevant pieces, and you can reason out the consequences pf change.

Complex systems can’t be fully understood. We can notice patterns, but we can’t guarantee what will happen when we change it.

Be careful in your aim for simplicity: don’t trade a little complication for a lot of complexity.

A lot of rules

A Swede, on American Football: “Are there any rules? It looks like they stand in two lines, someone throws the ball backwards, and then it’s a big pile.”

Me: “141 pages, last I checked. It takes a lot of rules to look like there aren’t any.”

Later, people talked about error messages. There are so many different ones! … or there should be. Actually they all fall back to “Something went wrong” when the server responds with non-success.

Was it a transient error? Maybe the client should retry, or ask the person to try again later. Was it a partial failure? Maybe the client should refresh its data. Was it invalid input? Please, please tell the person what they can do about it. In all supported languages.

The happy path may be the most frequent path, but it is one of thousands through your application. As software gets more complicated, the happy path becomes unlikely. (99.9% success in a thousand different calls has a 37% happy path, if the person gets everything right.) What makes an app smooth isn’t lack of failure, it’s accommodation of these alternate paths. Especially when people are being humans and don’t follow your instructions.

Each error is special. Each chance to smooth its handling is precious. People rarely report confusion, so jump on it when they do.

This alternate-path code gets gnarly. You may have 141 pages of it, and growing every year.

It takes a lot of rules to make an app “just work.”

In defense of rationality and dynamic programming

Karl Popper defines rationality as: basing beliefs on logical arguments and evidence. Irrationality is everything else.

He also defines comprehensive rationality as: only logical arguments and evidence are valid basis for belief. But this belief itself can only be accepted by choice or faith, so comprehensive rationality is self-contradictory. It also excludes a lot of useful knowledge. This, he says, is worse than irrationality.

It reminds me of some arguments for typed functional programming. We must have proof our programs are correct! We must make incorrect programs impossible to represent in compiling code! But this excludes a lot of useful programs. And do we even know what ‘correct’ is? Not in UI development, that’s for sure.

Pure functional programming eschews side effectsy (like printing output or writing data). Yet it is these side effects that make programs useful, that let’s them impact the world. Therefore, exclusively pure functional programming is worse than irrationality (all dynamic, side-effecting programming).

Popper argues instead for critical rationality: Start with tentative faith in premises, and then apply logical argument and evidence to see if they hold up. Consider new possible tenets, and see whether they hold up better. This kind of rationality accepts that there is no absolute truth, no perfect knowledge, only better. Knowledge evolves through critical argument. We can’t get to Truth, but we can free ourselves from some falsehoods.

This works in programming too. Sometimes we do know what ‘correct’ is. In those cases, it’s extremely valuable to have code that you know is solid. You know what it does, and even better, you know what it won’t do. Local guarantees of no mutation and no side effects are beautiful. Instead of ‘we don’t mutate parameters around here, so I’ll be surprised if that happens’ we get ‘it is impossible for that to happen so I put it out of my mind.’ This is what functional programmers mean when they talk about reasoning about code. There is no probably about it.

Where we can use logical arguments and evidence, let’s. For everything else, there’s JavaScript.

Everything is a big circle. Inside are some blobs of nonsense and a nice rectangle of purely functional programs. Some of the area between FP and nonsense is useful.

Tools -> capabilities -> acclaim

Here’s a lovely graphic from the folks at Intercom. It describes the difference between what companies sell and what people buy.

Even though customers buy this [skateboard parts]… they really want this [cool skateboard trick].

We don’t want the tool, we want what we can do with the tool. Take it further – maybe what that skateboarder really wants is: the high fives at the end.

skateboarder getting high fives from the crowd

Our accomplishments feel real when people around us appreciate them. If the skateboarder’s peers react with “Why would you do that?? You could die! You look like a dumbass,” titanium hardware doesn’t shine so bright.

It reminds me of the DevOps community. Back when coworkers said, “Why are you writing scripts for that? Do you want to put us out of a job?” automation wasn’t so cool. Now there’s a group of peers, at least online, ready to give you high fives for that. The people you meet at conferences, or get familiar with on podcasts, or collaborate on open source tools with — these take automation from laziness into accomplishment.

Giving developers polyurethane wheels and hollow trucks won’t let them do tricks. Move to a culture of “our job is to automate ourselves out of better and better jobs.” Give us the high fives (and let us pick our tools), and developers will invent new tricks.

Safety and Progress

At Papers We Love conference, Dr Heidi Howard described requirements for distributed consensus: Safety and Progress.

In distributed consensus, multiple servers decide on some value and then report that to their clients. Safety means that the clients never learn about different values; the consensus is all correct and consistent. Progress means that clients do eventually get the values. It doesn’t get stuck.

There are lots of ways to guarantee safety. The trick is to find ones that allow progress in all circumstances.

It reminds me of the conflict between security and development. Security teams are responsible for safety: prevention of bad things. Development teams are responsible for progress: making good things happen.

Separate these two responsibilities and you get deadlock. The obvious ways to get safety prevent progress, and the fastest routes to progress erode safety.

Algorithms, processes, designs that give you progress and safety exist, but they’re subtle. You won’t find them by fighting with each other.

In distributed consensus, the algorithm designer holds responsibility for safety and progress. To build the features that advance your business with the security that keeps it safe, put these responsibilities on the same team.

Safety and progress can be at odds. Don’t bake this conflict into your org structure.

Keep them entwined. Guarantee progress while allowing safety.

Moving beyond simplicity

We prefer simple models to complicated ones. Circles to ellipses, a single ancestor to a soup. But is that really because the simple explanation is more likely?

The value of keeping assumptions to a minimum is cognitive.

Philip Ball, The Tyranny of Simple Explanations

Simpler theories are more useful because we can think about them more easily. We can pass them from person to person.

But now we have software, which can encapsulate complicated theories almost as easily as simple ones. We can package them up into products and make use of them, without taking up brainspace with all those complications.

Maybe you still find simple models most likely in physics. Simple models are not accurate in biology – much less in human interactions.

In biology, everything going on had some reason it got that way (might be randomness) and many reasons that it stays that way. Most adaptations are exaptive — used for some purpose other than their original purpose. Most have many purposes, the proteins participate in multiple pathways, the fur is warm and also distinguishing, the brain anticipates danger and builds social structure.

In our own lives, every action has multiple stories. I’m playing this game to relax, I’m playing this game to avoid writing, I’m playing this game to learn from it.

We choose based on probability spaces from many models, stacked on top of each other. I buy this dress because I feel powerful in it, because I deserve it, because my date will like it, because I am sloppy with money. I value this relationship because I don’t know how to be alone, because they’re a kind person, because we belong together, because he reminds me of the father figure from my childhood.

This is right, this is healthy, to have all these stories. In real life, the simple explanation is never complete. Taking action for exactly one reason is unnatural. Like gaming a metric, it fails to account for the whole rest of the system.

Everything we do in a complex system has many effects, so it is right that we have many overlapping reasons. This doesn’t make life easy, only real.

When I started in software I loved that it was simple and predictable. Now I love that it is complex and messy, because that’s where I learn the important stuff.

Don’t know. Find out.

Legacy code is software that is no longer alive in someone’s head. No person has a mental model of how it works, no one has the stories of why.

This is already the common case in software systems, and it will only grow more normal.

Understanding an entire piece of the system is a luxury. We get it while we build it, but after a while, that knowledge fades too. Then, we reconstruct portions of the model we need for each change or fix.

As software grows, it is unrealistic to expect a complete mental model of it to exist, even across a team or company. Instead, aim for the ability to construct what we need as we need it.

Readable code is only the beginning. We need observability, discoverability, and operator interfaces that teach. We need to see what’s going on live. Only once we narrow in our target must we mentally simulate the runtime by reading code.

This is true at a large scale, too.

From early in Jeff Sussna’s book, Designing Delivery: “so-called rogue or shadow IT, where business units procure cloud-based IT services without the participation of a centralized IT department, makes it harder to control or even know which data lives” where.

Yeah! You won’t know all your data lives off the top of your head, nor is this recorded in a binder somewhere. You will know how to find out. And know that each location of data meets the security and recoverability needs of its business unit.

Let’s stop asking “how do we know?” We don’t know, usually. Instead, ask “how will we see?”