Don’t just keep trying; report your limits

The other day we had problems with a service dying. It ran out of memory, crashing and failing to respond to all open requests. That service was running analyses of repositories, digging through their files to report on the condition of the code.

It ran out of memory trying to analyze a particular large repository with hundreds of projects within it. This is a monorepo, and Atomist is built to help with microservices — code scattered across many repositories.

This particular day, Rod Johnson and I paired on this problem, and we found a solution that neither of us would have found alone. His instinct was to work on the program, tweaking the way it caches data, until it could handle this particular repository. My reaction was to widen the problem: we’re never going to handle every repository, so how do we fail more gracefully?

The infrastructure of the software delivery machine (the program that runs the analyses) can limit the number of concurrent analyses, but it can’t know how big a particular one will be.

However, the particular analysis can get an idea of how big it was going to be. In this case, one Aspect finds interior projects within the repository under scrutiny. My idea was: make that one special, run it first, and if there are too many projects, decline to do the analysis.

Rod, as a master of abstraction, saw a cleaner way to do it. He added a veto functionality, so that any Aspect can declare itself smart enough to know whether analysis should continue. We could add one that looks at the total number of files, or the size of the files.

We added a step to the analysis that runs these Vetoing Aspects first. We made them return not only “please stop,” but a reason for that stop. Then we put that into the returned analysis.

The result is: for too-large repositories, we can give back a shorter analysis that communicates: “There are too many projects inside this repository, and here is the list of them.” That’s the only information you get, but at least you know why that’s all you got.

And nothing else dies. The service doesn’t crash.

When a program identifies a case it can’t handle and stops, then it doesn’t take out a bunch of innocent-bystander requests. It gives a useful message to humans, who can then make the work easier, or optimize the program until it can handle this case, or add a way to override the precaution. This is a collaborative automation.

When you can’t solve a problem completely, step back and ask instead: can I know when to stop? “FYI, I can’t do this because…” is more useful than OutOfMemoryError.

A lot of rules

A Swede, on American Football: “Are there any rules? It looks like they stand in two lines, someone throws the ball backwards, and then it’s a big pile.”

Me: “141 pages, last I checked. It takes a lot of rules to look like there aren’t any.”

Later, people talked about error messages. There are so many different ones! … or there should be. Actually they all fall back to “Something went wrong” when the server responds with non-success.

Was it a transient error? Maybe the client should retry, or ask the person to try again later. Was it a partial failure? Maybe the client should refresh its data. Was it invalid input? Please, please tell the person what they can do about it. In all supported languages.

The happy path may be the most frequent path, but it is one of thousands through your application. As software gets more complicated, the happy path becomes unlikely. (99.9% success in a thousand different calls has a 37% happy path, if the person gets everything right.) What makes an app smooth isn’t lack of failure, it’s accommodation of these alternate paths. Especially when people are being humans and don’t follow your instructions.

Each error is special. Each chance to smooth its handling is precious. People rarely report confusion, so jump on it when they do.

This alternate-path code gets gnarly. You may have 141 pages of it, and growing every year.

It takes a lot of rules to make an app “just work.”

GOTO Amsterdam: Respect the past, renew the present

GOTO Amsterdam started with a retrospective on Java, and ended with the admonition that even Waterfall was an advancement in its time. The conference encouraged building on the past, renewing and adding for growth.

As our bodies renew themselves, replacing cells so we’re never quite the same organism; so our software wants renewal and gradual improvement, small frequent changes, all the way out to deployment. Chad Fowler reports that the average uptime of a node at Wunderkind is in hours, and he’d like to make it shorter. Code modules are small, data is small, and servers are small. Components fail and the whole continues. Don’t optimize the time between failures: optimize the time to recovery. And monitor! Testing helps us develop with confidence, but monitoring lets us deploy with confidence.

At Etsy as well, changes are small and deployments frequent. Bits of features are deployed to production before they’re ever activated. Then they’re activated gradually, a separate process from deployment, and carefully monitored. Etsy has one giant monolithic PHP app, and yet they’ve achieved 50/day deployments with great uptime. Monitoring removes fear: “It’s not about how often you deploy your app. It’s do you feel comfortable deploying from trunk, right now?”

That doesn’t happen all at once. It builds up through careful tooling and through careful consideration of each production outage, in post-mortems where everyone listens and root causes are investigated to levels impossible in a blame-assigning culture.  Linda Rising said, “The real problem in our organizations is nobody wants to talk about how we might be doing things wrong.” At Etsy, they talk about failure.

Even as we’re deploying small changes and gradually improving our code, there’s another way our code can renew and grow: the platform underneath.  Georges Saab told us part of the magic of Java, the way the JIT compiler team works with the people creating the latest hardware. By the time that hardware is released, the JVM is optimized for the latest improvements. Even beyond the platform, Java developers moved the industry away from build-it-yourself toward finding an open-source solution, building on the coding and testing and design efforts of others. And if we update those libraries, they’re renewing as well. We are not doing this alone.

And now in Java 8, there are even more opportunities for library-level optimization, as Stream processing raises the level of abstraction, letting us declare our intentions with a lambda expression instead of specifying the steps. Tell the language what you want it to do, not how, and it can optimize. Dan North used this technique back when he invented DevOps (I’m mostly kidding): look at the outcome you want, and ask how to get there. The steps you’ve used before are clues, not the plan.

Yet be careful with higher levels of abstraction: Horia Dragomir reminded us this can also hurt performance. This happens when the same code
compiles for Android and iPhone. There’s a Japanese concept called bokeh (pronounced like bouquet) of blurring parts of an image to bring others into focus. Abstraction can do that for us, if we’re careful as the photographer.

In the closing keynote, Linda Rising reminded us, to our chagrin: people don’t make decisions based on data. We make decisions based on stories. What stories are we telling ourselves and each other? Do our processes really work? There aren’t empirical studies about our precise situation. The best we can do is to keep trying new tweaks and different methods, and find out what works for us. Like a baby putting everything in their mouth.

We can acquire more data, and choose to use this for better decisions. At Etsy every feature implementation comes with monitoring: How will you know it’s working? How will you know if it breaks? Each feature has a dashboard. And then in the post-mortems, a person can learn “how immensely hard it is to fight biases.” If we discard blame, we can reveal our mistakes, and build on each others’ experiences.

Overcome fear: experience the worst-case scenario. Keep changing our code, and ourselves: “As a person, if you can’t change, you might as well be dead.” It’s OK to be wrong, when you don’t keep being wrong.

As Horia said, “You’re there, you’re on the shoulders of giants. You need to do your own thing now. Add your own twist.”

——————
This post based on talks by Linda Rising, Chad Fowler, Georges Saab and Paul Sandos, Horia Dragomir, Daniel Schauenberg; conversations with Silvana Wasitova and Kevlin Henney, all at GOTO Amsterdam 2014. Some of these may be online eventually

How to succeed without knowing how to succeed

Nature achieves more than any human mind conceives. The powers of predictive models and reasoning are dwarfed by a system without a brain. Why? If we understand this, we can achieve such greatness in our teams, and in our codebases.

Here’s an abstraction. 3 components (and billions of years) take us from amino acids to human beings:
– Some level of random variation
– a way to recognize “better” and keep those variations around, and
– no real cost to the system when a variation is worse.

In nature, mutation is the variation, and natural selection keeps more of “better” around than “worse.” If a particular organism is less fit and dies, the system doesn’t care. That cost is minute, while the potential benefits of progress are large. Every generation is at least as fit as the one before. On and on.

In a reasonably free market, there’s all kinds of people jumping in, there’s competition to keep a “better” business around, and one company going under doesn’t bother the system. New entrants learn from successes before them. Growth.

There is no requirement that anybody predict what will succeed. Only a way to recognize success when it happens, and go with it. Costs of failure are borne by individuals, while success benefits the whole system.

We can set up these conditions. Methods based on MCMC for solving complicated problems totally use this. Take a problem too complex to solve explicitly, but with a way to calculate how good a particular solution is. Example: image recognition, building a model for a scene. Start with some guess (“there’s a tree in the middle”), generate an image from the model, compare with the real one. Tweak the model. Did it get better? Keep it. Did it get worse? probably discard that change and try again. [1] The result ratchets closer and closer to the real solution. The algorithm recognizes success and discards failure quickly. Discovery.

These three examples show problems insoluble by reasoning or prediction, surmounted by recognizing success when it happens, repeatedly. Nature and the market use competition to recognize success, but that is not the only way. MCMC uses a calculation, comparing new results only to previous results. We can do the same when we have a single sample to optimize — one app or one team.

We can set up these conditions for code. At Etsy, there’s a monolithic PHP web app.[4] That doesn’t sound easy to keep reliable under change. Yet, they’ve got this figured out. Lots of developers are making changes, deploying 50x/day. Every feature includes monitoring that tells people whether it’s working, and whether it’s broken. This shows success. And if a deploy breaks something, people find out quickly. They can roll it back, or fix it, right away. Many changes; successes kept; failures mitigated. The app improves over time, with no grand scheme or Architect of Doom watching over it. Productivity.

There’s another element necessary when people are the ones making all the little changes. If failure has a high cost to the individual, then the incentive exists to hold still. But we need little variations net-nudging us toward success. Etsy has removed the cost of failure from the individual by keeping a blameless culture. They treat any outage as a learning opportunity for the whole company, and they do not punish the unfortunate developer whose deploy triggered it. There’s a safety there.

In the market, this safety is LLCs, where you can lose your business but keep your house. In nature or MCMC, the organism or parameter set doesn’t vary voluntarily, so no disincentive exists.

No project plan and organization chart can reach the potential of an agile team, when the team takes Linda Rising’s advice[2]. She said, every week in the retro, pick one thing to change about how the team works. Tweak it, try something new. If it doesn’t help, go back after a week. If it makes your team work better, keep it. Each week, the team is at least as good as the week before. Excellence.

The Romans didn’t develop their political system through a grand plan, according to Nassim Taleb in Antifragile. They did it by tinkering. Taleb calls the property of “Recognize success and win; discard failure and lose nothing” optionality. Combine optionality and randomness for unlimited success.[3]

What does this mean?

It means our apps don’t have to be beautifully architected if they’re well instrumented.

It means that the most important part of treating failure as a source of learning may well be removing failure as a source of persecution.

Removing fear of failure lets people try different ways of doing things. Metrics help recognize the right ones, the variations to keep. Monitoring and quick rollbacks make failures cheap at the system level. Maybe these 3 things, and time, are all we need to build complex software more useful than we can imagine.

—–
[1] In MCMC, you sometimes keep a worse solution, with a probability based on how much worse. It gets you out of local optimums.
[2] Linda Rising’s closing keynote to GOTO Amsterdam 2014.
[3] This post comes out of this section in Antifragile. The book is annoying, but the ideas in it are crucial.
[4] Daniel Schauenberg’s presentation at GOTO Amsterdam 2014

Weakness and Vulnerability

Weakness and vulnerability are different. Separate the concerns: [1]

Vulnerability is an openness to being wounded.
Weakness is inability to live through wounds.

In D&D terms: vulnerability is a low armor class, weakness is low hit points. Armor class determines how hard it is for an enemy to hit you, and hit points determine how many hits you can take. So you have a choice: prevent hits, or endure more hits.

If you try to make your software perfect, so that it never experiences a failure, that’s a high armor class. That’s aiming for invulnerability.

Thing is, in D&D, no matter how high your armor class, if the enemy makes a perfect roll (a 20 on a d20, a twenty-sided die), that’s a critical hit and it strikes you. Even if your software is bug-free, hardware goes down or misbehaves.

If you’ve spent all your energy on armor class and little on hit points, that single hit can kill you.

Embrace failure by letting go of ideal invulnerability, and think about recovery instead. I could implement signal handlers, and maintain them, and this is a huge pain and makes my code ugly. Or I could implement a separate cleanup mechanism for crashed processes. That’s a separation of concerns, and it’s more robust: signal handlers don’t help when the app is out of memory, a separate recovery does.

In the software I currently work on, I take the strategy of building safety nets at the application, process, subsystem, and module levels, as feasible.[3] Then while I try to get my code right, I don’t convolute my code looking for hardware and network failures, bad data and every error I can conceive. There are always going to be errors I don’t conceive. Fail gracefully, and pick up the pieces.

—–
An expanded version of this post, adding the human element, is on True in Software, True in Life.

—–
[1] Someone tweeted a quote from some book on this, on the difference between weakness and vulnerability, a few weeks ago and it clicked with me. I can’t find the tweet or the quote anymore. Anyone recognize this?
[3] The actor model (Akka in my case) helps with recovery. It implements “Have you restarted your computer?” at the small scale.