Saturday, May 7, 2016

Tradeoffs in Coordination Among Teams

The other day in Budapest, Jez Humble and I wondered, what is the CAP theorem for teams? In distributed database systems, the CAP theorem says: choose two of Consistency, Availability, and Partitioning — and you must choose Partitioning.
Consider a system for building software together. Unless the software is built by exactly one person, we have to choose Partitioning. We can’t meld minds, and talking is slow.
In databases we choose between Consistency (the data is the same everywhere) and Availability (we can always get the data). As teams grow, we choose between Consensus (doing things for the same reasons in the same way) and Actually-getting-things-done.
Or, letting go of the CAP acronym: we balance Moving Together against Moving Forward.

Moving Together


A group of 1 is the trivial case. Decision-making is the same as consensus. All work is forward work, but output is very limited, and when one person is sick everything stops.

A group of 2-7 is ideal: the communication comes with interplay of ideas, and whole new outputs of dialogue make up for the time cost of talking to each other. It is still feasible for everyone in the group to have a mental model of each other person, to know what that person needs to know. Consensus is easy to reach when every stakeholder is friends with every other stakeholder.

Beyond one team, the tradeoffs begin. Take one team of 2-7 people working closely together. Represent their potential output with this tall, hollow arrow pointing up.


This team is building software to run an antique store. Look at them go, full forward motion. (picture: tall, filled arrow.)

Next we add more to the web site while continuing development on the register point-of-sale tools. We break into two teams. We’re still working with the same database of items, and building the same brand, so we coordinate closely. We leverage each others' tools. More people means more coordination overhead, but we all like each other, so it’s not much burden. We are a community, after all.
A green arrow and a red arrow, each connected by many lines of communication, are filled about halfway up with work.

Now the store is doing well. The web site attracts more retail business, the neighboring antique stores want to advertise their items on our site, everything is succeeding and we add more people. A team for partnerships, which means we need externally-facing reports, which means we need a data pipeline.
A purple arrow and a blue arrow join the red and green ones. Lines crisscross between them, a snarly web. The arrows are filled only a little because of these coordination costs. The purple arrow is less connected, and a bit more full, but it's pointed thirty degrees to the left.

The same level of consensus and coordination isn’t practical anymore. Coordination costs weigh heavily. New people coming in don’t get to build a mental model of everyone who already works there. They don’t know what other people know, or which other people need to know something. If the partnerships team touches the database, it might break point of sale or the web site, so they are hamstrung. Everyone needs to check everything, so the slowest-to-release team sets the pace. The purple team here is spending less time on coordination, so the data pipeline is getting built, but without any ties to the green team, it’s going in a direction that won’t work for point of sale.

This mess scales up in the direction of mess. How do we scale forward progress instead?

Moving Forward


The other extreme is decoupling. Boundaries. A very clear API between the data pipeline, point of sale, and web. Separate databases, duplicating data when necessary. This is a different kind of overhead: more technical, less personal. Break the back-end coupling at the database; break the front-end (API) coupling with backwards compatibility. Teams operate on their own schedules, releases are not coordinated. This is represented by wider arrows, because backwards compatibility and graceful degradation are expensive. 
Four arrows, each wide and short. A few lines connect them. They're filled, but the work went to width (solidness) rather than height (forward progress).

These teams are getting about as far as the communication-burdened teams. The difference is: this does scale out. We can add more teams before coordination becomes a limitation again.

Amazon is an extreme example of this: backwards compatible all the things. Each team Moving Forward in full armor. Everything fully separate, so no team can predict what other teams depend on. This made the AWS products possible. However, this is a ton of technical overhead, and maybe also not the kindest culture to work in.

Google takes another extreme. Their monorepo allows more coupling between teams. Libraries are shared. They make up for this with extreme tooling. Tests, refactoring tools, custom version control and build systems — even whole programming languages. Thousands of engineers work on infrastructure at Google, so that they can Move Together using technical overhead.

Balance


For the rest of us, in companies with 7-1000 engineers, we can’t afford one extreme or the other. We have to ask: where is consensus important? and where is consensus holding us back?

Consensus is crucial in objectives and direction. We are building the same business. The business results we are aiming for had better be the same. We all need to agree on “Which way is up?"

Consensus is crippling at the back end. When we require any coordination of releases. When I can’t upgrade a library without impacting other teams in way I can't predict. When my database change could break a system more production-critical than mine. This is when we are paralyzed. Don't make teams share databases or libraries.

What about leveraging shared tools and expertise? if every team runs its own database, those arrows get really wide really fast, unless they skimp on monitoring and redundancy — so they will skimp and the system will be fragile. We don't want to reinvent everything in every team.

The answer is to have a few wide arrows. Shared tools are great when they’re maintained as internal services, by teams with internal customers. Make the data pipeline serve the partnership and reporting teams. Make a database team to supply well-supported database instances to the other teams. (They’re still separate databases, but now we have shared tools to work with them, and hey look, a data pipeline for syncing between them.)


The green, red, and blue arrows are narrow and tall, and mostly full of work, with some lines connecting them. The purple arrow and a new black arrow are wide and short and full of work. The wide arrows (internal services) are connected to the tall arrows (product teams) through their tips.

Re-use helps only when there is a solid API, when there is no coupling of schedules, and when the providing team focuses on customer service.

Conclusions


Avoid shared code libraries, unless you’re Google and have perfect test coverage everywhere, or you’re Amazon and have whole teams supporting those libraries with backwards compatibility.
Avoid shared database instances, but build internal teams around supporting common database tools.

Encourage shared ideas. Random communication among people across an organization has huge potential. Find out what other teams are doing, and that can refine your own direction and speed your development — as long as everything you hear is information, not obligation.

Reach consensus on what we want to achieve, why we are doing it, and how (at a high level) we might achieve it. Point in the same direction, move independently.

Every organization is a distributed system, even when we sit right next to each other. Coordination makes joint activity possible, but not free. Be conscious of the tradeoffs as your organization grows, as consensus becomes less useful and more expensive. Recognize that new people entering the organization experience higher coordination costs than you do. Let teams move forward in their own way, as long as we move together in a common direction. Distributed systems are hard, and we can do it.







Bonus material 

Here is a picture of Jez in Budapest.



And here is a paper about coordination costs:
Common Ground and Coordination in Joint Activity