In the Cold War, political and technical considerations were no longer separable. The President got a Science Advisory Committee, but “apparently… scientists must not concern themselves with devising and proposing policies; they ought to limit themselves to answering such technical questions as they may be asked.” (Leo Szilard, physicist)
Yikes! Sounds like old style software development, where the programmers receive the requirements from the business.
I think we’ve learned better than that. Many of the most successful companies are led by technical people. We need the business experts and software developers working together. The business doesn’t know all the questions developers can answer, and devs don’t know what questions to ask the business — until we start implementing. Then, necessary questions rise to the surface, and lead to discussions which include more useful questions.
If developers are “on tap” as a resource, we can’t create anything better than you can specify (and believe me, that list of requirements is no specification). Our collective imagination is better than either alone.
Asking useful questions is the hard part. Collaborate on it.
Back in the 2000s, we wrote applications in layers.
Presentation layer, client, data transfer, API, business logic, data access, database. We maintained strict separation bet ween these layers, even though every new feature changed all of them. Teams organized around these layers. Front end, back end, DBAs.
Layers crisscrossed the flow of data.
Responsibility for any one thing to work fell across many teams.
Interfaces between teams updated with every application change.
Development was slow and painful.
In 2019, we write applications in layers.
A business unit is supported by a feature team. Feature teams are supported by platforms, tooling, UI components. All teams are supported by software as a service from outside the company.
Back in the day, front end, back end, operations, and DBAs separated because they needed different skills. Now we accept that a software team needs all the skills. We group by responsibility instead — responsibility for business value, not for activities.
Supporting teams provide pieces in which consistency is essential: UI components and some internal libraries.
Interfaces between teams change less frequently than the software changes.
Layers crisscross the flow of value.
Feature teams need to do everything, from the old perspective. But that’s too hard for one team — so we make it easier.
This is where Developer Experience (DevEx) teams come in. (a.k.a. Developer Productivity, Platform and Tools, or inaccurately DevOps Teams.) These undergird the feature teams, making their work smoother. Self-service infrastructure, smooth setup of visibility and control for production software. Tools and expertise to help developers learn and do everything necessary to fulfill each team’s purpose.
Internal services are supported by external services. Managed services like Kubernetes, databases, queueing, observability, logging: we have outsourced the deep expertise of operating these components. Meanwhile, internal service teams like DevEx have enough understanding of the details, plus enough company-specific context, to mediate between what the outside world provides and what feature teams need.
This makes development smoother, and therefore faster and safer.
We once layered by serving data to software. Now we layer by serving value to people.
Today I typed psql to start a database session. That put me in the wrong place, so I typed \connect org_viz to get into the database I wanted.
But then I stopped myself, quit psql, and typed psql -d org_viz at the command prompt.
It smooths my work. I knew I would exit and re-enter that database session several times today, and this way pushing up-arrow to get to the last command would get me to the right command. No more “oh, right, I have to \connect” for today.
It makes my work more reproducible. As a dev, every command I type at a shell or REPL is either an experiment or an action. If it’s an experiment, I’ll do different things as fast as I can. If it’s an action, I want it to express my intention.
What I’m not doing is meandering around a toilsome path to complete some task that I know perfectly well how to do. Once known, all those steps belong in one repeatable, intention-expressing automation.
Correcting the command I typed is a tiny thing. It expresses a development aesthetic: repeatability. If I’m not exploring, I’m executing, and I execute in a repeatable fashion. I executed that tiny command to open the database I wanted. Then I re-used it a dozen times. Frustration saved, check. Developer aesthetic satisfied, check.
When we build software, we aren’t building it in nowhere. We aren’t building a closed system that doesn’t interact with its environment. We aren’t building it for our own computer (unless we are; personal automation is fun). We are building it for a purpose. Chances are, we build it for a unique purpose — because why else would they pay us to do it?
Understanding that surrounding system, the “why” of our product and each feature, makes a big difference in making good design decisions within the system.
It’s like, the system we’re building is our own house. We build on a floor of infrastructure other people have created (language, runtime, dependency manager, platform), making use of materials that we find in the world (libraries, services, tools). We want to understand how those work, and how our own software works. This is all inside our house.
To do that well, keep the windows open. Look outside, ask questions of the world. What purpose is our system serving? What effects does it have, and what effects from other subsystems does it strengthen?
Whenever you’re designing something, the first step is: What is the system my system lives in? I need to understand that system to understand what my system does.
It is a big world out there, and these are questions we can never answer completely. It’s tempting to stay indoors where it’s warm. We can’t know everything, but we gotta try for more.
Today in a keynote at Spring One, Tom Gianos from Netflix talked about their internal data platform. He listed several components, ending with quick mention of the “Insights Services” team, which studies how the platform is used inside Netflix. A team of people that learns about how internal teams use an internal platform to learn about whatever they’re doing. This is some higher-order learning going on.
It’s like, a bunch of teams are making shows for customers. They want to get better at that, so they need data about how the shows are being watched.
So, Netflix builds a data platform, and some teams work on that. The data platform helps the shows teams (and whatever other teams, I’m making this up) complete a feedback loop, so they can get better at making shows.
Then the data platform teams want to make a better data platform, so an Insights Services team collects data about how the data platform itself is used. I’m betting they use the data platform for that. I also bet they talk to people on the shows teams. Then Insights Services closes that feedback loop with the data platform team, so that Netflix can get better at getting better at making shows.
Essential links in this loops include telemetry in all these platforms. The software that delivers shows to customers is emitting events. The data platform jobs are emitting events about what they’re doing and for whom.
When a human does a job, reporting what they’re doing is extra work for them. (Usually flight attendants write drink orders on paper, or keep them in memory. The other day I saw them entering orders into iPads. Guess which was faster.) In any human system, gathering information costs money, time, and customer service. In a software system, it’s a little extra network traffic. Woo.
Software systems give us the ability to study them. To really find out what’s going on, what was working, and what wasn’t. The Insights Services team, as part of the data platform organization, can form hypotheses and then test them, adding telemetry as needed. As a team with internal customers, they can talk to the humans to find out what they’re missing. They can get both the data they think they need, and a glimpse into everything else.
Software organizations are a beautiful opportunity for learning about systems. We can do science here: a kind of science where we don’t try to find universal laws, and instead try to find the forces at work in our local situation, learn them and then sometimes change them.
When we get better at getting better — wow. That adds up to some serious acceleration over time. With learning loops about learning loops, Netflix has impressive and growing advantages over competitors.
The other day we had problems with a service dying. It ran out of memory, crashing and failing to respond to all open requests. That service was running analyses of repositories, digging through their files to report on the condition of the code.
It ran out of memory trying to analyze a particular large repository with hundreds of projects within it. This is a monorepo, and Atomist is built to help with microservices — code scattered across many repositories.
This particular day, Rod Johnson and I paired on this problem, and we found a solution that neither of us would have found alone. His instinct was to work on the program, tweaking the way it caches data, until it could handle this particular repository. My reaction was to widen the problem: we’re never going to handle every repository, so how do we fail more gracefully?
The infrastructure of the software delivery machine (the program that runs the analyses) can limit the number of concurrent analyses, but it can’t know how big a particular one will be.
However, the particular analysis can get an idea of how big it was going to be. In this case, one Aspect finds interior projects within the repository under scrutiny. My idea was: make that one special, run it first, and if there are too many projects, decline to do the analysis.
Rod, as a master of abstraction, saw a cleaner way to do it. He added a veto functionality, so that any Aspect can declare itself smart enough to know whether analysis should continue. We could add one that looks at the total number of files, or the size of the files.
We added a step to the analysis that runs these Vetoing Aspects first. We made them return not only “please stop,” but a reason for that stop. Then we put that into the returned analysis.
The result is: for too-large repositories, we can give back a shorter analysis that communicates: “There are too many projects inside this repository, and here is the list of them.” That’s the only information you get, but at least you know why that’s all you got.
And nothing else dies. The service doesn’t crash.
When a program identifies a case it can’t handle and stops, then it doesn’t take out a bunch of innocent-bystander requests. It gives a useful message to humans, who can then make the work easier, or optimize the program until it can handle this case, or add a way to override the precaution. This is a collaborative automation.
When you can’t solve a problem completely, step back and ask instead: can I know when to stop? “FYI, I can’t do this because…” is more useful than OutOfMemoryError.
In business, we want to focus on our core domain, and let everything else be “good enough.” We need accounting, payroll, travel. But we don’t need those to be special if our core business is software for hospitals.
As developers, we want to focus on changing our software, because that is our core work. We want other stuff, such as video conferencing, email, and blog platforms to be “good enough.” It should just work, and get out of our way.
The thing is: “good enough” doesn’t stay good enough. Who wants to use Concur for booking travel? No one. It’s incredibly painful and way behind modern web applications that we use for personal travel. Forcing them into an outdated travel booking system holds your people back and makes recruiting a little harder.
When we rent software as a service, then it can keep improving. I shuddered the last time I got invited to a WebEx, but it’s better than it used to be. WebEx is not as slick as Zoom, but it was fine.
There is a lot of value in continuing with the same product that your other systems and people integrate with, and having it improve underneath you. Switching is expensive, especially in the focus it takes. But it beats keeping the anachronism.
DevOps says, “If it hurts, do it more.” This drives you to improve processes that are no longer good enough. Now and then you can turn a drag into a competitive advantage. Now and then, like with deployment, you find out that what you thought was your core business (writing code) is not core after all. (Operating useful software is.)
Limiting what you focus on is important. Let everything else be “good enough,” but check it every once in a while to make sure it still is. Ask the new employee, “What around here seems out of date compared to other places you’ve worked?” Or try a full week of mob programming, and notice when it gets embarrassing to have six people in the same drudgery.
When you have a common piece of functionality to share between two apps, do you make a library for them to share, or break out a service?
The biggest difference between publishing a library or operating a service is: who controls the pace of change.
If you publish a library for people to use, you can put out new versions, but it is up to each application’s team to incorporate that new version. Upgrades are gradual and may never fully happen.
If you operate a service, you control when upgrades happen. You put the new code in production, and poof, everyone is using it. You can upgrade multiple times per day, and you control when each is complete.
If you need certain logic to be consistent between applications, consider making a service. Control the pace of change.
Avdi Grimm describes how the book Vehicles illustrates how simple parts can compose a very complex system.
Another example is Conway’s Game of Life: a nice organized grid, uniform time steps, and four tiny rules. These simple pieces combine to make all kinds of wild patterns, and even look alive!
These systems are complex: hard to predict; forces interact to produce surprising behavior.
They are not complicated: they do not have a bunch of different parts, nor a lot of little details.
In software, a monolith is complicated. It has many interconnected parts, each full of details. Conditionals abound, forming intricate rules that may surprise us when we read them.
But we can read them.
As Avdi points out, microservices are different. Break that monolith into tiny pieces, and each piece can be simple. Put those pieces together in a distributed system, and … crap. Distributed systems are inherently complex. Interactions can’t be predicted. Surprising behaviors occur.
Complicated programs can be understood, it’s just a lot to hold in your head. Build a good enough model of the relevant pieces, and you can reason out the consequences pf change.
Complex systems can’t be fully understood. We can notice patterns, but we can’t guarantee what will happen when we change it.
Be careful in your aim for simplicity: don’t trade a little complication for a lot of complexity.