Let’s reason about behavior

When we learn math, geometry, and logic in school, we’re always talking about things that are holding still. The element is in the set or it isn’t. The angle is acute or obtuse or right or we can’t know. Things are related or not. A thing has a property, or doesn’t.

Code can have properties. A function has side effects, or doesn’t. Data is mutated, or never. An API call can be idempotent. Whole programs can have properties and we like to reason about them.

Yesterday talking with Will Larson on >Code (episode 142), he pointed out that once we move beyond a single process, especially when we go to microservices and have oodles of processes running around, we don’t get to talk about properties anymore. Instead, we have behaviors. He said:

Properties, you can statically analyze.
Behaviors, you can verify they happen.

A call to one program, or even a series of calls, can be transactional. Once you’re in a distributed system, not a thing. You can talk about how the system behaves.

Rein pointed out that in distributed systems, properties are incredibly expensive. Guarantees like all-or-nothing transactions, exactly-once delivery, consistency are never perfect in the real world, and the closer you choose to be, the more you pay in money and latency. Coordination is expensive.

In addition, can we get better at verifying and reasoning about behaviors?

Will pointed out that fault injection is a way to verify behaviors. That makes sense: psychologists learn a lot about the way we think from a few people with localized brain injuries. Emitting and querying events is another way.

Then how do we reason about behaviors? Systems thinking helps. Will recommends Donella Meadows’s Primer as a start. (I loved that book too.) Also, the social sciences have been studying behaviors forever. Maybe their methods, like Grounded Theory, can help us.

We’re people, right? We have behaviors. If we can get better at naming and reasoning about them, maybe we can get better at being people. It could happen.

Predictability is hard, surprises are everywhere

Today at J on the Beach, Helena Edelson talked about a lot of things that Akka can do for event streaming, microservices,  generally handling a lot of data and keeping it (eventually) consistent across great distances. With opt-in causal consistency! 

I am amazed by how hard it is to have consistent causality and data across geographic distance. It is counterintuitive, because our culture emphasizes logical deduction and universality. Science, as we learned it, is straight-line deterministic causality. The difficulty of making systems that work like that across geographic distance illustrates how little of our human-scale world works that way.

the “hard” sciences are narrow.

It is great that people are working on software that can have these properties of consistency, and that they’re studying the limitations. Meanwhile it is also fun to work on the other side of the problem: letting go of perfect predictability, and building (relative) certainty on top of uncertainty. Right now I’m in a talk by Philip Brisk about the 4th Industrial Revolution of IoT sensors and machine learning sensemaking. It turns out: now that we can synthesize DNA in a little machine, we can also find out what DNA was synthesized using only a microphone placed near the machine (and a lot of domain knowledge). Humans are fascinating.

predictability is hard, and surprises are everywhere, and that is beautiful.

Reuse

Developers have a love-hate relationship with code re-use. As in, we used to love it. We love our code and we want it to run everywhere and help everyone. We want to get faster with time by harnessing the work of our former selves.
And yet, we come to hate it. Reuse means dependencies. It means couplings. It means surprises, when changing code impacts something we did not expect, or else it means don’t touch it, it’s too scary. It means trusting code we don’t understand because it’s code didn’t write.

//platform.twitter.com/widgets.js
Here’s the thing: sharing code is dangerous. Do it sparingly.

When reuse is bad

Let’s talk about sharing code. Take a business, developing software for its employees or its customers. Let’s talk about code within an organization that is referenced in more than one service, or by multiple flows in a monolith. (Monolith is defined as “one deployable unit maintained by more than one small team.”)

Let’s see some pictures. Purple Service here has some classes or functions that it finds useful, and the team thinks these would be useful elsewhere. Purple team breaks this code out into a library, the peachy circle.

purple circle, peach circle inside

Then someone from Purple team joins Blue team, and uses that library in Blue Service. You think it looks like this:

peach circle under blue and purple circles

Nah, it’s really more like this:

purple circle with peach circle inside. Blue circle has a line to peach circle

This is called coupling. When Purple team changes their library, Blue team is affected. (If it’s a monolith, their code changed underneath them. I hope they have good tests.)
Now, you could say, Blue team doesn’t have to update their version. The level of reuse is the release, we broke out the library, so this is fine.

picture of purple with orange circle, blue with peach circle.

At that point you’ve basically forked, the code isn’t shared anymore. When Blue team needs to make their own changes, they first must upgrade, so they get surprised some unpredictable time later. (This happened to us at Outpace all the time with our shared “util” libraries and it was the worst. So painful. Those “timesavers” cost us a lot of time and frustration.)

This shared code is a coupling between two services that otherwise have nothing to do with each other. The whole point of microservices was to decouple! To make it so our changes impact only code that our team operates! dead. and for what?

To answer that, consider the nature of the shared code. Why is it shared?
Perhaps it is unrelated to the business: it is general utilities that would otherwise be duplicated, but we’re being DRY and avoiding the extra work of writing and testing and debugging them a second time. In this case, I propose: cut and paste. Or fork. Or best of all, try a more formalized reuse-without-sharing procedure [link to my next post].

What if this is business-related code? What if we had good reason to DRY it out, because it would be wrong for this code to be different in Purple Service and Blue Service? Well sorry, it’s gonna be different. Purple and Blue do not have the same deployment schedules, that’s the point of decoupling into services. In this case, either you’ve made yourself a distributed monolith (requiring coordinated deployments), or you’re ignoring reality. If the business requires exactly one version of this code, then make it its own service.

picture with yellow, purple, and blue circles separate, dotty lines from yellow to purple and to blue.

Now you’re not sharing code anymore. You’re sharing a service. Changes to Peachy can impact Purple and Blue at the same time, because that’s inherent in this must-be-consistent business logic.

It’s easier with a monolith; that shared code stays consistent in production, because there is only one deployment. Any surprises happen immediately, hopefully in testing. In a monolith, if Peachy is utility classes or functions, and Purple (or Blue) team wants to change them, the safest strategy is: make a copy, use the copy, and change that copy. Over time, this results in less shared code.

This crucial observation is #2 in Modern Software Over-engineering Mistakes by RMX.

“Shared logic and abstractions tend to stabilise over time in natural systems. They either stay flat or relatively go down as functionality gets broader.”

Business software is an expanding problem. It will always grow, and not with more of the same: it will grow in ways you didn’t plan for. This kind of code must optimize for change. Reuse is the enemy of change. (I’m talking about reuse of internal code.)

Back in the beginning, Blue team reused the peach library and saved time writing code. But writing code isn’t the expensive part, compared to changing code. We don’t add features faster as our systems get larger and we have more code hypothetically available for re-use. We add features more slowly, because every change has more impacts and is less safe. Shared code makes change less safe. The only code safe to share is code that doesn’t change. Which means no versioning. Heck, you might as well have cut and pasted it.

When reuse is good

We didn’t advance as an industry by rewriting, or cut and pasting, everything we need over and over. We build on libraries published by developers and companies all over the globe. They release them, we reuse them. Yes, we get into dependency hell, but it beats writing your own web framework. We get reuse not only of the code, but of understanding: Rails knowledge transfers between employers.

There is a tipping point where reuse is magical.

I argue that this point is well past a release, past a separate jar.
It is past a stable API
past a coherent abstraction
past automated tests
past solid documentation…

All these might be achieved within the organization if responsibility for the shared utilities lives in a separate team; you can try to use Conway’s Law to enforce architectural boundaries, but within an org, those boundaries are soft. And this code isn’t your business, and you don’t have incentives to spend the time on these. Why have backwards compatibility when you can perform human coordination instead? It isn’t worth it. In my past organizations, shared code has instead been the responsibility of no one. What starts out as “leverage” becomes baggage, as all the Ruby code is tied to an old version of Sinatra. Some switch to Go to get a clean slate.
Break those chains! Copy the pieces you need out of that internal library and make them yours.

At the level of winning reuse, that code has its own marketing department
its own sales team
its own office manager
its own stock price.

The level of reuse is the company.

(Pay for software.)

When the responsible organization succeeds by making its code stable and backwards-compatible and easy to work with and well-documented and extensively tested, that is code I want to reuse!

In addition to SaaS companies and vendors, there are organizations built around open-source software. This is why we look for packages and frameworks with a broad community around them. Or better, a foundation for keeping shared projects healthy. (Contribute to them.)

Conclusion

Reuse is dangerous because it introduces coupling. Share business code only when that coupling is inherent to the business domain. Share library and utility code only when it is maintained by an organization dedicated to publishing that code. (Same with services. If you can pay for infrastructure-level tools, you’ll get better tools without distracting your organization.)

Why did we want to reuse internal code anyway?
For speed, but speed of change is more important.
For consistency, but that means coupling. Don’t hold your teams back with it.
For propagation of bug fixes, which I’ve not seen happen.

All three of these can be automated [LINK to my next post] without dependencies.

Next time you consider making your code reusable, ask “who will I sell this to?”
Next time someone (including you) suggests you reuse their code, ask “who publishes that?” and if they say “me,” copy it instead.

Provenance and causality in distributed systems

Can you take a piece of data in your system and say what version of code put it in there, based on what messages from other systems? and what information a human viewed before triggering an action?

Me neither.

Why is this acceptable? (because we’re used to it.)
We could make this possible. We could trace the provenance of data. And at the same time, mostly-solve one of the challenges of distributed systems.

Speaking of distributed systems…

In a distributed system (such as a web app), we can’t say for sure what events happened before others. We get into general relativity complications even at short distances, because information travels through networks at unpredictable speeds. This means there is no one such thing as time, no single sequence of events that says what happened before what. There is time-at-each-point, and inventing a convenient fiction to reconcile them is a pain in the butt.

We usually deal with this by funneling every event through a single point: a transactional database. Transactions prevent simultaneity. Transactions are a crutch.

Some systems choose to apply an ordering after the fact, so that no clients have to wait their turn in order to write events into the system. We can construct a total ordering, like the one that the transactional database is constructing in realtime, as a batch process. Then we have one timeline, and we can use this to think about what events might have caused which others. Still: putting all events in one single ordering is a crutch. Sometimes, simultaneity is legit.

When two different customers purchase two different items from two different warehouses, it does not matter which happened first. When they purchase the same item, it still doesn’t matter – unless we only find one in inventory. And even then: what matters more, that Justyna pushed “Buy” ten seconds before Edith did, or that Edith upgraded to 1-day shipping? Edith is in a bigger hurry. Prioritizing these orders is a business decision. If we raise the time-ordering operation to the business level, we can optimize that decision. At the same time, we stop requiring the underlying system to order every event with respect to every other event.

On the other hand, there are events that we definitely care happened in a specific sequence. If Justyna cancels her purchase, that was predicated on her making it. Don’t mix those up. Each customer saw a specific set of prices, a tax amount, and an estimated ship date. These decisions made by the system caused (in part) the customer’s purchase. They must be recorded either as part of the purchase event, or as events that happened before the purchase.

Traditionally we record prices and estimated ship date as displayed to the customer inside the purchase. What if instead, we thought of the pricing decision and the ship date decision as events that happened before the purchase? and the purchase recorded that those events definitely happened before the purchase event?

We would be working toward establishing a different kind of event ordering. Did Justyna’s purchase happen before Edith’s? We can’t really say; they were at different locations, and neither influenced the other. That pricing decision though, that did influence Justyna’s purchase, so the price decision happened before the purchase.

This allows us to construct a more flexible ordering, something wider than a line.

Causal ordering

Consider a git history. By default, git log prints a line of commits as if they happened in that order — a total ordering.

But that’s not reality. Some commits happen before others: each commit I make is based on its parent, and every parent of that parent commit, transitively. So the parent happened before mine. Meanwhile, you might commit to a different branch. Whether my commit happened before yours is irrelevant. The merge commit brings them together; both my commit and yours happen before the merge commit, and after the parent commit. There’s no need for a total ordering here. The graph expresses that.

This is a causal ordering. It doesn’t care so much about clock time. It cares what commits I worked from when I made mine. I knew about the parent commit, I started from there, so it’s causal. Whatever you were doing on your branch, I didn’t know about it, it wasn’t causal, so there is no “before” or “after” relationship to yours and mine.

We can see the causal ordering clearly, because git tracks it: each commit knows its parents. The cause of each commit is part of the data in the commit.

Back to our retail example. If we record each event along with the events that caused it, then we can make a graph with enough of a causal ordering.

There are two reasons we want an ordering here: external consistency and internal legibility.

External Consistency

External consistency means that Justyna’s experience remains true. Some events are messages from our software system to Justyna (the price is $), and others are messages coming in (Confirm Purchase, Cancel Purchase). The sequence of these external interactions constrains any event ordering we choose. Messages crossing the system boundary must remain valid.

Here’s a more constricting example of external consistency: when someone runs a report and sees a list of transactions for the day, that’s an external message. That message is caused by all the transactions reported in it. If another transaction comes in late, it must be reported later as an amendment to that original report — whereas, if no one had run the report for that day yet, it could be lumped in with the other ones. No one needs to know that it was slow, if no one had looked.

Have you ever run a report, sent the results up the chain, and then had the central office accuse you of fudging the numbers because they run the same report (weeks later) and see different totals? This happens in some organizations, and it’s a violation of external consistency.

Internal Legibility

Other causal events are internal messages: we displayed this price because the pricing system sent us a particular message. The value of retaining causal information here is troubleshooting, and figuring out how our system works.

I’m using the word “legibility”[1] in the sense of “understandability:” as a person we have visibility into the system’s workings, we can follow along with what it’s doing. Distinguish its features, locate problems and change it.

 If Justyna’s purchase event is caused by a ship date decision, and the ship date decision (“today”) tracked its causes (“the inventory system says we have one, with more arriving today”), then we can construct a causal ordering of events. If Edith’s purchase event tracked a ship date decision (“today”) which tracked its causes (“the inventory system says we have zero, with more arriving today”), then we can track a problem to its source. If in reality we only send one today, then it looks like the inventory system’s shipment forecasts were inaccurate.

How would we even track all this?

The global solution to causal ordering is: for every message sent by a component in the system, record every message received before that. Causality at a point-in-time-at-a-point-in-space is limited to information received before that point in time, at that point in space. We can pass this causal chain along with the message.

“Every message received” is a lot of messages. Before Justyna confirmed that purchase, the client component received oodles of messages, from search results, from the catalog, from the ad optimizer, from the review system, from the similar-purchases system, from the categorizer, many more. The client received and displayed information about all kinds of items Justyna did not purchase. Generically saying “this happened before, therefore it can be causal, so we must record it ALL” is prohibitive.

This is where business logic comes in. We know which of these are definitely causal. Let’s pass only those along with the message.

There are others that might be causal. The ad optimizer team probably does want to know which ads Justyna saw before her purchase. We can choose whether to include that with the purchase message, or to reconstruct an approximate timeline afterward based on clocks in the client or in the components that persist these events. For something as aggregated as ad optimization, approximate is probably good enough. This is a business tradeoff between accuracy and decoupling.

Transitive causality

How deep is the causal chain passed along with a message?

We would like to track backward along this chain. When we don’t like the result of Justyna and Edith’s purchase fulfillment, we trace it back. Why did the inventory system said the ship date would be today in both cases. This decision is an event, with causes of “The current inventory is 1” and “Normal turnover for this item is less than 1 per day”; or “The current inventory is 0” and “a shipment is expected today” and “these shipments usually arrive in time to be picked the same day.” From there we can ask whether the decision was valid, and trace further to learn whether each of these inputs was correct.

If every message comes with its causal events, then all of this data is part of the “Estimated ship date today” sent from the inventory system to the client. Then the client packs all of that into its “Justyna confirmed this purchase” event. Even with slimmed-down, business-logic-aware causal listings, messages get big fast.

Alternately, the inventory system could record its decision, and pass a key with the message to the client, and then the client only needs to retain that key. Recording every decision means a bunch of persistent storage, but it doesn’t need to be fast-access. It’d be there for troubleshooting, and for aggregate analysis of system performance. Recording decisions along with the information available at the time lets us evaluate those decisions later, when outcomes are known.

Incrementalness

A system component that chooses to retain causality in its events has two options: repeat causal inputs in the messages it sends outward; or record the causal inputs and pass a key in the messages it sends outward.

Not every system component has to participate. This is an idea that can be rolled out gradually. The client can include in the purchase event as much as its knows: the messages it received, decisions it made, and relevant messages sent outward before this incoming “Confirm Purchase” message was received from Justyna. That’s useful by itself, even when the inventory system isn’t yet retaining its causalities.

Or the inventory system could record its decisions, the code version that made them, and the inputs that contributed to them, even though the client doesn’t retain the key it sends in the message. It isn’t as easy to find the decision of interest without the key, but it could still be possible. And some aggregate decision evaluation can still happen. Then as other system components move toward the same architecture, more benefits are realized.

Conscious Causal Ordering

The benefits of a single, linear ordering of events are consistency, legibility, and visibility into what might be causal. A nonlinear causal ordering gives us more flexibility, consistency, a more accurate but less simplified legibility, and clearer visibility into what might be causal. Constructing causal ordering at the generic level of “all messages received cause all future messages sent” is expensive and also less meaningful than a business-logic-aware, conscious causal ordering. This conscious causal ordering gives us external consistency, accurate legibility, and visibility into what we know to be causal.

At the same time, we can have provenance for data displayed to the users or recorded in our databases. We can know why each piece of information is there, and we can figure out what went wrong, and we can trace all the data impacted by an incorrect past event.

I think this is something we could do, it’s within our ability today. I haven’t seen a system that does it, yet. Is it because we don’t care enough — that we’re willing to say “yeah, I don’t know why it did that, can’t reproduce, won’t fix”? Is it because we’ve never had it before — if we once worked in a system with this kind of traceability, would we refuse to ever go back?


[1] This concept of “legibility” comes from the book Seeing Like a State.

Tradeoffs in Coordination Among Teams

The other day in Budapest, Jez Humble and I wondered, what is the CAP theorem for teams? In distributed database systems, the CAP theorem says: choose two of Consistency, Availability, and Partitioning — and you must choose Partitioning.
Consider a system for building software together. Unless the software is built by exactly one person, we have to choose Partitioning. We can’t meld minds, and talking is slow.
In databases we choose between Consistency (the data is the same everywhere) and Availability (we can always get the data). As teams grow, we choose between Consensus (doing things for the same reasons in the same way) and Actually-getting-things-done.
Or, letting go of the CAP acronym: we balance Moving Together against Moving Forward.

Moving Together

A group of 1 is the trivial case. Decision-making is the same as consensus. All work is forward work, but output is very limited, and when one person is sick everything stops.
A group of 2-7 is ideal: the communication comes with interplay of ideas, and whole new outputs of dialogue make up for the time cost of talking to each other. It is still feasible for everyone in the group to have a mental model of each other person, to know what that person needs to know. Consensus is easy to reach when every stakeholder is friends with every other stakeholder.
Beyond one team, the tradeoffs begin. Take one team of 2-7 people working closely together. Represent their potential output with this tall, hollow arrow pointing up.
This team is building software to run an antique store. Look at them go, full forward motion. (picture: tall, filled arrow.)
Next we add more to the web site while continuing development on the register point-of-sale tools. We break into two teams. We’re still working with the same database of items, and building the same brand, so we coordinate closely. We leverage each others’ tools. More people means more coordination overhead, but we all like each other, so it’s not much burden. We are a community, after all.
A green arrow and a red arrow, each connected by many lines of communication, are filled about halfway up with work.
Now the store is doing well. The web site attracts more retail business, the neighboring antique stores want to advertise their items on our site, everything is succeeding and we add more people. A team for partnerships, which means we need externally-facing reports, which means we need a data pipeline.
A purple arrow and a blue arrow join the red and green ones. Lines crisscross between them, a snarly web. The arrows are filled only a little because of these coordination costs. The purple arrow is less connected, and a bit more full, but it’s pointed thirty degrees to the left.
The same level of consensus and coordination isn’t practical anymore. Coordination costs weigh heavily. New people coming in don’t get to build a mental model of everyone who already works there. They don’t know what other people know, or which other people need to know something. If the partnerships team touches the database, it might break point of sale or the web site, so they are hamstrung. Everyone needs to check everything, so the slowest-to-release team sets the pace. The purple team here is spending less time on coordination, so the data pipeline is getting built, but without any ties to the green team, it’s going in a direction that won’t work for point of sale.
This mess scales up in the direction of mess. How do we scale forward progress instead?

Moving Forward

The other extreme is decoupling. Boundaries. A very clear API between the data pipeline, point of sale, and web. Separate databases, duplicating data when necessary. This is a different kind of overhead: more technical, less personal. Break the back-end coupling at the database; break the front-end (API) coupling with backwards compatibility. Teams operate on their own schedules, releases are not coordinated. This is represented by wider arrows, because backwards compatibility and graceful degradation are expensive. 
Four arrows, each wide and short. A few lines connect them. They’re filled, but the work went to width (solidness) rather than height (forward progress).
These teams are getting about as far as the communication-burdened teams. The difference is: this does scale out. We can add more teams before coordination becomes a limitation again.
Amazon is an extreme example of this: backwards compatible all the things. Each team Moving Forward in full armor. Everything fully separate, so no team can predict what other teams depend on. This made the AWS products possible. However, this is a ton of technical overhead, and maybe also not the kindest culture to work in.
Google takes another extreme. Their monorepo allows more coupling between teams. Libraries are shared. They make up for this with extreme tooling. Tests, refactoring tools, custom version control and build systems — even whole programming languages. Thousands of engineers work on infrastructure at Google, so that they can Move Together using technical overhead.

Balance

For the rest of us, in companies with 7-1000 engineers, we can’t afford one extreme or the other. We have to ask: where is consensus important? and where is consensus holding us back?
Consensus is crucial in objectives and direction. We are building the same business. The business results we are aiming for had better be the same. We all need to agree on “Which way is up?”
Consensus is crippling at the back end. When we require any coordination of releases. When I can’t upgrade a library without impacting other teams in way I can’t predict. When my database change could break a system more production-critical than mine. This is when we are paralyzed. Don’t make teams share databases or libraries.
What about leveraging shared tools and expertise? if every team runs its own database, those arrows get really wide really fast, unless they skimp on monitoring and redundancy — so they will skimp and the system will be fragile. We don’t want to reinvent everything in every team.
The answer is to have a few wide arrows. Shared tools are great when they’re maintained as internal services, by teams with internal customers. Make the data pipeline serve the partnership and reporting teams. Make a database team to supply well-supported database instances to the other teams. (They’re still separate databases, but now we have shared tools to work with them, and hey look, a data pipeline for syncing between them.)
The green, red, and blue arrows are narrow and tall, and mostly full of work, with some lines connecting them. The purple arrow and a new black arrow are wide and short and full of work. The wide arrows (internal services) are connected to the tall arrows (product teams) through their tips.
Re-use helps only when there is a solid API, when there is no coupling of schedules, and when the providing team focuses on customer service.

Conclusions

Avoid shared code libraries, unless you’re Google and have perfect test coverage everywhere, or you’re Amazon and have whole teams supporting those libraries with backwards compatibility.
Avoid shared database instances, but build internal teams around supporting common database tools.
Encourage shared ideas. Random communication among people across an organization has huge potential. Find out what other teams are doing, and that can refine your own direction and speed your development — as long as everything you hear is information, not obligation.
Reach consensus on what we want to achieve, why we are doing it, and how (at a high level) we might achieve it. Point in the same direction, move independently.
Every organization is a distributed system, even when we sit right next to each other. Coordination makes joint activity possible, but not free. Be conscious of the tradeoffs as your organization grows, as consensus becomes less useful and more expensive. Recognize that new people entering the organization experience higher coordination costs than you do. Let teams move forward in their own way, as long as we move together in a common direction. Distributed systems are hard, and we can do it.


Bonus material 

Here is a picture of Jez in Budapest.

And here is a paper about coordination costs:
Common Ground and Coordination in Joint Activity