Work with the business, not for it

Scientists should be on tap but not on top.

Winston Churchill

In the Cold War, political and technical considerations were no longer separable. The President got a Science Advisory Committee, but “apparently… scientists must not concern themselves with devising and proposing policies; they ought to limit themselves to answering such technical questions as they may be asked.” (Leo Szilard, physicist)

Yikes! Sounds like old style software development, where the programmers receive the requirements from the business.

I think we’ve learned better than that. Many of the most successful companies are led by technical people. We need the business experts and software developers working together. The business doesn’t know all the questions developers can answer, and devs don’t know what questions to ask the business — until we start implementing. Then, necessary questions rise to the surface, and lead to discussions which include more useful questions.

If developers are “on tap” as a resource, we can’t create anything better than you can specify (and believe me, that list of requirements is no specification). Our collective imagination is better than either alone.

Asking useful questions is the hard part. Collaborate on it.

Layers in software: from data to value

Then

Back in the 2000s, we wrote applications in layers.

Presentation layer, client, data transfer, API, business logic, data access, database. We maintained strict separation bet ween these layers, even though every new feature changed all of them. Teams organized around these layers. Front end, back end, DBAs.

Each layer of software is a wide box, next to its team.
They stack on top of each other: frontend stuff, backend stuff, database, each with its team.
At the top are some customers. Value flows from them to the db and back, crossing all the layers.
Business value exists only by flowing through all the layers to the DB and back.

Layers crisscrossed the flow of data.

Responsibility for any one thing to work fell across many teams.

Interfaces between teams updated with every application change.

Development was slow and painful.

Now

In 2019, we write applications in layers.

A business unit is supported by a feature team. Feature teams are supported by platforms, tooling, UI components. All teams are supported by software as a service from outside the company.

Feature teams at the top of the software are multicolored, with multiple components in their software.
Under them are platform and component teams, each different.
Under them are nice square boxes of external services.
Business value flows through the top layer (feature teams), staying close to the business people.
Developer value flows between the feature teams, through the internal teams, to external services and back.
Business value is concentrated in the feature teams; developer value flows through support teams and external services.

Back in the day, front end, back end, operations, and DBAs separated because they needed different skills. Now we accept that a software team needs all the skills. We group by responsibility instead — responsibility for business value, not for activities.

Supporting teams provide pieces in which consistency is essential: UI components and some internal libraries.

Interfaces between teams change less frequently than the software changes.

Layers crisscross the flow of value.

DevEx

Feature teams need to do everything, from the old perspective. But that’s too hard for one team — so we make it easier.

This is where Developer Experience (DevEx) teams come in. (a.k.a. Developer Productivity, Platform and Tools, or inaccurately DevOps Teams.) These undergird the feature teams, making their work smoother. Self-service infrastructure, smooth setup of visibility and control for production software. Tools and expertise to help developers learn and do everything necessary to fulfill each team’s purpose.

Internal services are supported by external services. Managed services like Kubernetes, databases, queueing, observability, logging: we have outsourced the deep expertise of operating these components. Meanwhile, internal service teams like DevEx have enough understanding of the details, plus enough company-specific context, to mediate between what the outside world provides and what feature teams need.

This makes development smoother, and therefore faster and safer.

We once layered by serving data to software. Now we layer by serving value to people.

Morning Stance

It is 7:09. One child is out, and I have returned to bed. Alexa will wake me at 7:15.

Six minutes: I could make my bed or do tiny morning yoga. Six minutes of rest is useless; I’ll feel worse afterward. What am I likely to do?

I picture the probability space in front of me. Intention, habit, and a better start to the day push me toward yoga. Yet there’s a boundary there, a blockage: it is my current stance.

At 7:09, if I were standing, I’d likely do yoga. But at 7:09 and horizontal, I’m gonna stay horizontal. Only a change in surrounding conditions (beep, beep, beep!) will trigger motion.

Cat Swetel talks about stances. By changing your stance, you change your inclinations.

It is 7:10. I choose to change my stance. I stand up.

I make my bed.

One deliberate change of stance, and positive habits and intentions take it from there.

Developer aesthetic: a command line

Today I typed psql to start a database session. That put me in the wrong place, so I typed \connect org_viz to get into the database I wanted.

But then I stopped myself, quit psql, and typed psql -d org_viz at the command prompt.

Why?

It smooths my work. I knew I would exit and re-enter that database session several times today, and this way pushing up-arrow to get to the last command would get me to the right command. No more “oh, right, I have to \connect” for today.

It makes my work more reproducible. As a dev, every command I type at a shell or REPL is either an experiment or an action. If it’s an experiment, I’ll do different things as fast as I can. If it’s an action, I want it to express my intention.

What I’m not doing is meandering around a toilsome path to complete some task that I know perfectly well how to do. Once known, all those steps belong in one repeatable, intention-expressing automation.

Correcting the command I typed is a tiny thing. It expresses a development aesthetic: repeatability. If I’m not exploring, I’m executing, and I execute in a repeatable fashion. I executed that tiny command to open the database I wanted. Then I re-used it a dozen times. Frustration saved, check. Developer aesthetic satisfied, check.

Don’t build systems. Build subsystems.

Always consider your design a subsystem.

Jabe Bloom

When we build software, we aren’t building it in nowhere. We aren’t building a closed system that doesn’t interact with its environment. We aren’t building it for our own computer (unless we are; personal automation is fun). We are building it for a purpose. Chances are, we build it for a unique purpose — because why else would they pay us to do it?

Understanding that surrounding system, the “why” of our product and each feature, makes a big difference in making good design decisions within the system.

It’s like, the system we’re building is our own house. We build on a floor of infrastructure other people have created (language, runtime, dependency manager, platform), making use of materials that we find in the world (libraries, services, tools). We want to understand how those work, and how our own software works. This is all inside our house.

To do that well, keep the windows open. Look outside, ask questions of the world. What purpose is our system serving? What effects does it have, and what effects from other subsystems does it strengthen?

Whenever you’re designing something, the first step is: What is the system my system lives in? I need to understand that system to understand what my system does.

Jabe Bloom

It is a big world out there, and these are questions we can never answer completely. It’s tempting to stay indoors where it’s warm. We can’t know everything, but we gotta try for more.

Nested learning loops at Netflix

Today in a keynote at Spring One, Tom Gianos from Netflix talked about their internal data platform. He listed several components, ending with quick mention of the “Insights Services” team, which studies how the platform is used inside Netflix. A team of people that learns about how internal teams use an internal platform to learn about whatever they’re doing. This is some higher-order learning going on.

It’s like, a bunch of teams are making shows for customers. They want to get better at that, so they need data about how the shows are being watched.

So, Netflix builds a data platform, and some teams work on that. The data platform helps the shows teams (and whatever other teams, I’m making this up) complete a feedback loop, so they can get better at making shows.

diagram: customers get shows from the show team; that interaction sends something to the data platform, which sends something to the shows team. That interaction (between the shows team and the data platform) sends something to the Insights Services team, which sends info to the data platform team.

Then the data platform teams want to make a better data platform, so an Insights Services team collects data about how the data platform itself is used. I’m betting they use the data platform for that. I also bet they talk to people on the shows teams. Then Insights Services closes that feedback loop with the data platform team, so that Netflix can get better at getting better at making shows.

Essential links in this loops include telemetry in all these platforms. The software that delivers shows to customers is emitting events. The data platform jobs are emitting events about what they’re doing and for whom.

When a human does a job, reporting what they’re doing is extra work for them. (Usually flight attendants write drink orders on paper, or keep them in memory. The other day I saw them entering orders into iPads. Guess which was faster.) In any human system, gathering information costs money, time, and customer service. In a software system, it’s a little extra network traffic. Woo.

Software systems give us the ability to study them. To really find out what’s going on, what was working, and what wasn’t. The Insights Services team, as part of the data platform organization, can form hypotheses and then test them, adding telemetry as needed. As a team with internal customers, they can talk to the humans to find out what they’re missing. They can get both the data they think they need, and a glimpse into everything else.

Software organizations are a beautiful opportunity for learning about systems. We can do science here: a kind of science where we don’t try to find universal laws, and instead try to find the forces at work in our local situation, learn them and then sometimes change them.

When we get better at getting better — wow. That adds up to some serious acceleration over time. With learning loops about learning loops, Netflix has impressive and growing advantages over competitors.

Don’t just keep trying; report your limits

The other day we had problems with a service dying. It ran out of memory, crashing and failing to respond to all open requests. That service was running analyses of repositories, digging through their files to report on the condition of the code.

It ran out of memory trying to analyze a particular large repository with hundreds of projects within it. This is a monorepo, and Atomist is built to help with microservices — code scattered across many repositories.

This particular day, Rod Johnson and I paired on this problem, and we found a solution that neither of us would have found alone. His instinct was to work on the program, tweaking the way it caches data, until it could handle this particular repository. My reaction was to widen the problem: we’re never going to handle every repository, so how do we fail more gracefully?

The infrastructure of the software delivery machine (the program that runs the analyses) can limit the number of concurrent analyses, but it can’t know how big a particular one will be.

However, the particular analysis can get an idea of how big it was going to be. In this case, one Aspect finds interior projects within the repository under scrutiny. My idea was: make that one special, run it first, and if there are too many projects, decline to do the analysis.

Rod, as a master of abstraction, saw a cleaner way to do it. He added a veto functionality, so that any Aspect can declare itself smart enough to know whether analysis should continue. We could add one that looks at the total number of files, or the size of the files.

We added a step to the analysis that runs these Vetoing Aspects first. We made them return not only “please stop,” but a reason for that stop. Then we put that into the returned analysis.

The result is: for too-large repositories, we can give back a shorter analysis that communicates: “There are too many projects inside this repository, and here is the list of them.” That’s the only information you get, but at least you know why that’s all you got.

And nothing else dies. The service doesn’t crash.

When a program identifies a case it can’t handle and stops, then it doesn’t take out a bunch of innocent-bystander requests. It gives a useful message to humans, who can then make the work easier, or optimize the program until it can handle this case, or add a way to override the precaution. This is a collaborative automation.

When you can’t solve a problem completely, step back and ask instead: can I know when to stop? “FYI, I can’t do this because…” is more useful than OutOfMemoryError.

Stick with “good enough,” until it isn’t

In business, we want to focus on our core domain, and let everything else be “good enough.” We need accounting, payroll, travel. But we don’t need those to be special if our core business is software for hospitals.

As developers, we want to focus on changing our software, because that is our core work. We want other stuff, such as video conferencing, email, and blog platforms to be “good enough.” It should just work, and get out of our way.

The thing is: “good enough” doesn’t stay good enough. Who wants to use Concur for booking travel? No one. It’s incredibly painful and way behind modern web applications that we use for personal travel. Forcing them into an outdated travel booking system holds your people back and makes recruiting a little harder.

When we rent software as a service, then it can keep improving. I shuddered the last time I got invited to a WebEx, but it’s better than it used to be. WebEx is not as slick as Zoom, but it was fine.

There is a lot of value in continuing with the same product that your other systems and people integrate with, and having it improve underneath you. Switching is expensive, especially in the focus it takes. But it beats keeping the anachronism.

DevOps says, “If it hurts, do it more.” This drives you to improve processes that are no longer good enough. Now and then you can turn a drag into a competitive advantage. Now and then, like with deployment, you find out that what you thought was your core business (writing code) is not core after all. (Operating useful software is.)

Limiting what you focus on is important. Let everything else be “good enough,” but check it every once in a while to make sure it still is. Ask the new employee, “What around here seems out of date compared to other places you’ve worked?” Or try a full week of mob programming, and notice when it gets embarrassing to have six people in the same drudgery.

You might learn something important.

Library vs service: who controls change?

When you have a common piece of functionality to share between two apps, do you make a library for them to share, or break out a service?

The biggest difference between publishing a library or operating a service is: who controls the pace of change.

If you publish a library for people to use, you can put out new versions, but it is up to each application’s team to incorporate that new version. Upgrades are gradual and may never fully happen.

If you operate a service, you control when upgrades happen. You put the new code in production, and poof, everyone is using it. You can upgrade multiple times per day, and you control when each is complete.

If you need certain logic to be consistent between applications, consider making a service. Control the pace of change.

(For one flipside, see: From Complicated to Complex)

From complicated to complex

Avdi Grimm describes how the book Vehicles illustrates how simple parts can compose a very complex system.

Another example is Conway’s Game of Life: a nice organized grid, uniform time steps, and four tiny rules. These simple pieces combine to make all kinds of wild patterns, and even look alive!

These systems are complex: hard to predict; forces interact to produce surprising behavior.

They are not complicated: they do not have a bunch of different parts, nor a lot of little details.

In software, a monolith is complicated. It has many interconnected parts, each full of details. Conditionals abound, forming intricate rules that may surprise us when we read them.

But we can read them.

As Avdi points out, microservices are different. Break that monolith into tiny pieces, and each piece can be simple. Put those pieces together in a distributed system, and … crap. Distributed systems are inherently complex. Interactions can’t be predicted. Surprising behaviors occur.

Complicated programs can be understood, it’s just a lot to hold in your head. Build a good enough model of the relevant pieces, and you can reason out the consequences pf change.

Complex systems can’t be fully understood. We can notice patterns, but we can’t guarantee what will happen when we change it.

Be careful in your aim for simplicity: don’t trade a little complication for a lot of complexity.