GOTO Amsterdam: Respect the past, renew the present

GOTO Amsterdam started with a retrospective on Java, and ended with the admonition that even Waterfall was an advancement in its time. The conference encouraged building on the past, renewing and adding for growth.

As our bodies renew themselves, replacing cells so we’re never quite the same organism; so our software wants renewal and gradual improvement, small frequent changes, all the way out to deployment. Chad Fowler reports that the average uptime of a node at Wunderkind is in hours, and he’d like to make it shorter. Code modules are small, data is small, and servers are small. Components fail and the whole continues. Don’t optimize the time between failures: optimize the time to recovery. And monitor! Testing helps us develop with confidence, but monitoring lets us deploy with confidence.

At Etsy as well, changes are small and deployments frequent. Bits of features are deployed to production before they’re ever activated. Then they’re activated gradually, a separate process from deployment, and carefully monitored. Etsy has one giant monolithic PHP app, and yet they’ve achieved 50/day deployments with great uptime. Monitoring removes fear: “It’s not about how often you deploy your app. It’s do you feel comfortable deploying from trunk, right now?”

That doesn’t happen all at once. It builds up through careful tooling and through careful consideration of each production outage, in post-mortems where everyone listens and root causes are investigated to levels impossible in a blame-assigning culture.  Linda Rising said, “The real problem in our organizations is nobody wants to talk about how we might be doing things wrong.” At Etsy, they talk about failure.

Even as we’re deploying small changes and gradually improving our code, there’s another way our code can renew and grow: the platform underneath.  Georges Saab told us part of the magic of Java, the way the JIT compiler team works with the people creating the latest hardware. By the time that hardware is released, the JVM is optimized for the latest improvements. Even beyond the platform, Java developers moved the industry away from build-it-yourself toward finding an open-source solution, building on the coding and testing and design efforts of others. And if we update those libraries, they’re renewing as well. We are not doing this alone.

And now in Java 8, there are even more opportunities for library-level optimization, as Stream processing raises the level of abstraction, letting us declare our intentions with a lambda expression instead of specifying the steps. Tell the language what you want it to do, not how, and it can optimize. Dan North used this technique back when he invented DevOps (I’m mostly kidding): look at the outcome you want, and ask how to get there. The steps you’ve used before are clues, not the plan.

Yet be careful with higher levels of abstraction: Horia Dragomir reminded us this can also hurt performance. This happens when the same code
compiles for Android and iPhone. There’s a Japanese concept called bokeh (pronounced like bouquet) of blurring parts of an image to bring others into focus. Abstraction can do that for us, if we’re careful as the photographer.

In the closing keynote, Linda Rising reminded us, to our chagrin: people don’t make decisions based on data. We make decisions based on stories. What stories are we telling ourselves and each other? Do our processes really work? There aren’t empirical studies about our precise situation. The best we can do is to keep trying new tweaks and different methods, and find out what works for us. Like a baby putting everything in their mouth.

We can acquire more data, and choose to use this for better decisions. At Etsy every feature implementation comes with monitoring: How will you know it’s working? How will you know if it breaks? Each feature has a dashboard. And then in the post-mortems, a person can learn “how immensely hard it is to fight biases.” If we discard blame, we can reveal our mistakes, and build on each others’ experiences.

Overcome fear: experience the worst-case scenario. Keep changing our code, and ourselves: “As a person, if you can’t change, you might as well be dead.” It’s OK to be wrong, when you don’t keep being wrong.

As Horia said, “You’re there, you’re on the shoulders of giants. You need to do your own thing now. Add your own twist.”

This post based on talks by Linda Rising, Chad Fowler, Georges Saab and Paul Sandos, Horia Dragomir, Daniel Schauenberg; conversations with Silvana Wasitova and Kevlin Henney, all at GOTO Amsterdam 2014. Some of these may be online eventually

How to succeed without knowing how to succeed

Nature achieves more than any human mind conceives. The powers of predictive models and reasoning are dwarfed by a system without a brain. Why? If we understand this, we can achieve such greatness in our teams, and in our codebases.

Here’s an abstraction. 3 components (and billions of years) take us from amino acids to human beings:
– Some level of random variation
– a way to recognize “better” and keep those variations around, and
– no real cost to the system when a variation is worse.

In nature, mutation is the variation, and natural selection keeps more of “better” around than “worse.” If a particular organism is less fit and dies, the system doesn’t care. That cost is minute, while the potential benefits of progress are large. Every generation is at least as fit as the one before. On and on.

In a reasonably free market, there’s all kinds of people jumping in, there’s competition to keep a “better” business around, and one company going under doesn’t bother the system. New entrants learn from successes before them. Growth.

There is no requirement that anybody predict what will succeed. Only a way to recognize success when it happens, and go with it. Costs of failure are borne by individuals, while success benefits the whole system.

We can set up these conditions. Methods based on MCMC for solving complicated problems totally use this. Take a problem too complex to solve explicitly, but with a way to calculate how good a particular solution is. Example: image recognition, building a model for a scene. Start with some guess (“there’s a tree in the middle”), generate an image from the model, compare with the real one. Tweak the model. Did it get better? Keep it. Did it get worse? probably discard that change and try again. [1] The result ratchets closer and closer to the real solution. The algorithm recognizes success and discards failure quickly. Discovery.

These three examples show problems insoluble by reasoning or prediction, surmounted by recognizing success when it happens, repeatedly. Nature and the market use competition to recognize success, but that is not the only way. MCMC uses a calculation, comparing new results only to previous results. We can do the same when we have a single sample to optimize — one app or one team.

We can set up these conditions for code. At Etsy, there’s a monolithic PHP web app.[4] That doesn’t sound easy to keep reliable under change. Yet, they’ve got this figured out. Lots of developers are making changes, deploying 50x/day. Every feature includes monitoring that tells people whether it’s working, and whether it’s broken. This shows success. And if a deploy breaks something, people find out quickly. They can roll it back, or fix it, right away. Many changes; successes kept; failures mitigated. The app improves over time, with no grand scheme or Architect of Doom watching over it. Productivity.

There’s another element necessary when people are the ones making all the little changes. If failure has a high cost to the individual, then the incentive exists to hold still. But we need little variations net-nudging us toward success. Etsy has removed the cost of failure from the individual by keeping a blameless culture. They treat any outage as a learning opportunity for the whole company, and they do not punish the unfortunate developer whose deploy triggered it. There’s a safety there.

In the market, this safety is LLCs, where you can lose your business but keep your house. In nature or MCMC, the organism or parameter set doesn’t vary voluntarily, so no disincentive exists.

No project plan and organization chart can reach the potential of an agile team, when the team takes Linda Rising’s advice[2]. She said, every week in the retro, pick one thing to change about how the team works. Tweak it, try something new. If it doesn’t help, go back after a week. If it makes your team work better, keep it. Each week, the team is at least as good as the week before. Excellence.

The Romans didn’t develop their political system through a grand plan, according to Nassim Taleb in Antifragile. They did it by tinkering. Taleb calls the property of “Recognize success and win; discard failure and lose nothing” optionality. Combine optionality and randomness for unlimited success.[3]

What does this mean?

It means our apps don’t have to be beautifully architected if they’re well instrumented.

It means that the most important part of treating failure as a source of learning may well be removing failure as a source of persecution.

Removing fear of failure lets people try different ways of doing things. Metrics help recognize the right ones, the variations to keep. Monitoring and quick rollbacks make failures cheap at the system level. Maybe these 3 things, and time, are all we need to build complex software more useful than we can imagine.

[1] In MCMC, you sometimes keep a worse solution, with a probability based on how much worse. It gets you out of local optimums.
[2] Linda Rising’s closing keynote to GOTO Amsterdam 2014.
[3] This post comes out of this section in Antifragile. The book is annoying, but the ideas in it are crucial.
[4] Daniel Schauenberg’s presentation at GOTO Amsterdam 2014

Data Flows One Way

At QCon NY, Adam Ernst talked about the way Facebook is rewriting their UIs to use a functional approach. When all the UI components subscribing to the model, and the model subscribing to UI components (even through the controller), it’s a whole wad of interconnectedness.

Instead, it has been decreed that all data flows from the top. The GUI structure is a function of the model state. If a GUI component wishes to change the model state, that event triggers a regeneration of the GUI structure. Then, for performance, React.js does a comparison of the newly-desired DOM with the existing one, and updates only the parts that have changed.

Data flowing in one direction is a crucial part of functional programming. Persistent data structures, copy-on-mutate with structural sharing, and two-way links between parts of the structure don’t go together. Choose the first two.

In higher-level architectures, microservices are all the rage. Unlike old-style legible architecture diagrams, the dependency diagram in microservices looks like the death star. Services connect directly to each other willy-nilly.

There are alternative microstructure architectures that, like React.js, get the data flowing in one direction. Fred George describes putting all the messages on a bus (“the rapids”) and let services spy on the messages relevant for them (“the river”). The only output a service has is more messages, delivered into the rapids for any other service to consume.

This is cool in some ways. New services can build on what’s out there, without anyone knowing to send anything to them directly. However, the dependencies still exist. And it’s slower than direct connections.

What about… (also from a QCon session) OSGi is the very well-developed solution to this on the JVM. Anyone can look for a particular service, and get connected up through a trusted broker. Once the connection is made, it’s direct, no more overhead. Is that ideal?

Adam’s talk The Functional Programming Concepts in Facebook’s Mobile Apps” is on InfoQ

Limitations of Abstraction, and the Code+Coder symbiosis

Notes from #qconnewyork

I went into programming because I loved the predictability of it. Unlike physics, programs were deterministic at every scale. That’s not true anymore – and it doesn’t mean programming isn’t fun. This came out in some themes of QCon New York 2014.

In the evening keynote, Peter Wang told us we’ve been sitting pretty on a stable machine architecture for a long time, and that party is over. The days of running only on x86 architecture are done. We can keep setting up our VMs and pretending, or we can pay attention to the myriad devices cropping up faster than people can build strong abstractions on top of them. The Stable Dependencies Principle is crumbling under us.

Really we haven’t had a good, stable architecture to build on since applications moved to the web, as Gilad Bracha reminded us in the opening keynote. JavaScript has limitations, but even more, the different browsers keep programmers walking on eggshells trying not to break any of them. The responsibility of a developer is no longer just their programming language. They need to know where their code is running and all the relevant quirks of the platform. “It isn’t turtles all the way down anymore. We are the bottom turtle, or else the turtle under you eats your lunch.” @pwang

As a developer’s scope deepens, so also is it widening. Dianne Marsh’s keynote and Adrian Cockroft’s session about how services are implemented at Netflix emphasized developer responsibility through the whole lifecycle of the code. A developer’s job ends when the code is retired from production. Dianne’s mantra of “Know your service” puts the power to find a problem in the same hands that can fix it. Individual developers implement microservices, deploy them gradually to production, and monitor them. Developers understanding the business context of their work, and what it means to be successful.

It’d be wonderful to have all the tech and business knowledge in one head. What stops us is: technical indigestion. Toooo much information! The Netflix solution to this is: great tooling. When a developer needs to deploy, it’s their job to know what the possible problems are. It is the tool’s job to know how to talk to AWS, how to find out what the status is of running deployments, how to reroute between old-version and new-version deployments. The tool gives all the pertinent information to the person deploying, and the person makes the decisions. Enhanced cognition, just like Engelbert always wanted (from @pwang’s keynote).
“When you have automation plus people, that’s when you get something useful.” – Jez
“Free the People. Optimize the Tools.”- Dianne Marsh

Those gradual rollouts, they’re one of the new possibilities now that machines aren’t physical resources in data centers. We can deploy with less risk, because rollback becomes simply a routing adjustment. Lowering the impact of failure lets us take more risks, make more changes, and improve faster without impacting availability. To learn and innovate, do not prevent failure! Instead, detect it and stay on your

This changed deployment process is an example of something Adrian Cockroft emphasizes: question assumptions. What is free that used to be expensive? What can we do with that, that we couldn’t before? One of those is the immutable code, where every version of a service is available until someone makes the decision to take it down. And since you’re on pager duty for all your deployed code, there’s incentive to take it down.

When developers are responsible for the code past writing it, through testing and deploy and production, this completes a feedback loop. Code quality goes up, because the consequences of bugs fall directly on the person who can prevent them. This is a learning opportunity for the developer. It’s also a learning opportunity for the code! Code doesn’t learn and grow on its own, but widen the lines. Group the program in with the programmer into one learning organism, a code+coder symbiote. Then the code in production, as its effects are revealed by monitoring, can teach the programmer how to make it better in the next deployment.

Connection between code and people was the subject of Michael Feathers’ talk. Everyone knows Conway’s Law: architecture mirrors the org chart. Or as he phrases it, communication costs drive structure in software. Why not turn it to our advantage? He proposed structuring the organization around the needs of the code. Balance maintaining an in-depth knowledge base of each application against getting new eyes on it. Boundaries in the code will always follow the communication boundaries of social structure, so divide teams where the code needs to divide, by organization and by room. Eric Evans also suggested using Conway’s Law to maintain boundaries in the code. Both of these talks also emphasized the value of legacy code, and also the need for renewal: as the people turn over, so must the code. Otherwise that code+coder symbiosis breaks down.

Eric Evans emphasized: When you have a legacy app that’s a Big Ball of Mud, and you want to work on it, the key is to establish boundaries. Use social structure to do this, and create an Anti-Corruption Layer to intermediate between the two, and consider using a whole new programming language. This discourages accidental dependencies, and (as a bonus) helps attract good programmers.

Complexity is inevitable in software; bounded contexts are part of the constant battle to keep it from eating us. “We can’t eliminate complexity any more than a physicist can eliminate gravity.” (@pwang)

In code and with people, successful relationships are all about establishing boundaries. At QCon it was a given that people are writing applications as groups of services, and probably running them in the cloud. A service forms a bounded context; each service has its internal model, as each person has a mental model of the world. Communications between services also have their own models. Groups of services may have a shared interstitial context, as people in the same culture have established protocols. (analogy mine) No one model covers all of communications in the system. This was the larger theme of Eric Evans’ presentation: no one model, or mandate, or principle applies everywhere. The first question
of any architecture direction is “When does this apply?”

As programmers and teams are going off in their own bounded contexts doing their own deployments, Jez Humble emphasized the need to come together — or at least bring the code together — daily. You can have a separate repo for every service, like at Netflix, or one humongoid Perforce repository for everything, like with Google. You can work on feature branches or straight on master. The important part is: everyone commits to trunk at the end of the day. This forces breaking features into small features; they may hide behind feature flags or unused APIs, but they’re in trunk. And of course that feeds into the continuous deployment pipeline. Prioritize keeping that trunk deployable over doing new work. And when the app is always deployable, a funny thing happens: marketing and developers start to collaborate. There’s no feature freeze, no negotiating of what’s going to be in the next release. As developers take responsibility of the post-coding lifecycle, they gain insight into the pre-coding part too. More learning can happen!

As developers start to follow the code more closely, organizational structure can’t hold to a controlled hierarchy. Handoffs are the enemy of innovation, according to Adrian. The result of many independent services is an architecture diagram that can only be observed from production monitoring, and it looks like the Death Star:

I wonder how long before HR diagrams catch up and look like this too?

Dianne and Jez both used “Highly aligned, loosely coupled” to describe code and organization. Leadership provides direction, and the workers figure out how to reach the target by continually trying things out. Managers enable this experimentation. If the same problem is solved in multiple ways, that’s a win: bring the results together and learn from both. No one solution applies in all contexts.

Overall, QCon New York emphasized: question what you’re used to. Question what’s under you, and question what people say you can’t do. Face up to realities of distributed computing: consistency doesn’t exist, and failure is ever present. We want to keep rolling through failure, not prevent it. We can do this by building tools that support careful decision making. If we each support our code, our code will support our work, and we can all improve.

This post draws from talks by Peter Wang, Dianne Marsh, Adrian Cockroft, Eric Evans, Michael Feathers, Jez Humble, Ines Sombra, Richard Minerich, Charles Humble. It also draws from my head.
Most of the talks will be available on InfoQ eventually.