Capturing the World in Software

TL;DR – we can get a complete, consistent model of a small piece of the world using Event Sourcing. This is powerful but expensive.

Today on twitter, Jimmy Bogard on the tradeoffs of Event Sourcing:

If event sourcing is not scalable, faster, or simpler, why use it?

Event Sourcing gives you a complete, consistent model of the slice of the world modeled by your software. That’s pretty attractive.

We want to model the real world in software.

You can think about the present world as a sum of everything that happened before. Looking around my room, I can say that my bookshelf is the sum of various purchases, some moving around, a set of decisions about what to read and what to keep.

my bookshelf has philosophy, math, visualization, and a hippo

I can think of myself as the sum of everything that has happened, plus the stories I told myself about that. My next action is an outcome of this, plus my present surroundings, plus incoming events. That action itself is an event in the world.

In life, in biology, we don’t get to see all these inputs. We don’t get to change the response algorithm and try again. But in software, we can!

Of course we want perfect modeling and traceability of decisions! This way we can always answer “why,” and we can improve our understanding and decisionmaking strategies as we learn.

This is what Event Sourcing offers.

We want our model to be complete and consistent.

It’s impossible to model the entire world. Completeness and consistency are in conflict, sadly. Still, if we limit “complete” to a business domain, and to the boundaries of our company, this is possible. Theoretically.

Event Sourcing offers a way to do that.

In event sourcing, every piece of input is an event. Someone requests a counseling appointment, event. Provider signs up for available hours, event. Appointment scheduled, event. Customer notified, event. Customer shows up, event. Session report filed, event.

We can sum past events to get the current state

Skim the timeline of all events for the relevant ones. Sum these up (there are other definitions of “sum” besides adding numbers). From this we calculate the state of the world.

From appointment-was-scheduled events, we construct a provider’s calendar for the day.

At the end of the month, we construct reports on customers served and provider utilization. Based on that, we might seek more providers or have a talk with the less active ones. Headquarters ranks the performance of our office compared with others.

We need to allow corrections

To accurately model the real world, we need to allow for all the stuff that happens in the real world.

Appointments are cancelled. Customers don’t show up. Session reports are filed late. (“Where’s that session report from last week?” “Oh right, they were too late, because the gate to the parking lot malfunctioned. Don’t charge them for it.”)

Data is late or lost. If you insist that this doesn’t happen (“Every provider must enter the session reports by the end of the day”) then your model is blind to reality. The weather turns bad, people go home. There’s a bomb threat, or an active shooter. Reality intrudes.

Events outside your careful model will happen. Accommodate corrections, incorporate events that arrive late, accept partial data. The more of reality you allow into your model, the more accurate it can be.

We can evaluate past decisions based on the information available at the time

When data arrives late, reports change after they are printed. An event sourced system handles this.

As new data comes in about past days, it gets summed in with the data about those days. Reports get more accurate.

A friend of mine works at a counseling center, and he gets calls from headquarters like “Why is your utilization so low for December?” and he’s like “What? It was fine” and then he runs the report again and sure enough, it’s different. After he ran the report, more data about December came in, and now the totals are different. He can’t reproduce the reports he saw, which makes it hard to explain his actions to HQ.

If their software used event sourcing, he could say, “Please run the report as of January 2, and you’ll see why I didn’t take any action.”

Each event records a received timestamp, for when we learned about it, and an effective timestamp, for the real-world happening it represents. Then the software can sum only the events received before January 2 to reproduce the report as it was seen that day.

We can re-evaluate the world with new logic

Not only can an event-sourced system reproduce the same report as on an earlier day, we can ask: what if we changed the report logic? Then what would it look like?

Maybe we want to report unreported appointments as “possibly cancelled” to reflect uncertainty. We can run the new logic against the same events and compare it to the old results.

This means we can run tests against the event stream and detect behavior changes.

We need to record externally-visible decisions for consistency

When we change the software, we endanger consistency.

If we update the report logic in February, then when HQ runs the report “as of January 2” they’ll see something different than my friend saw when he ran it on that date. For consistency, both the data and code need to match what existed on January 2.

Or, we can model the report itself as an event. “On January 2, I said this about December.” Then we can incorporate that into the reporting logic.

Anything our system does that is visible to the outside world is itself an event, because it changes the way external people and software act. To reproduce our behavior consistently, our system can either record its own behavior, or retain all the data and the code that went into choosing it.

So far, this is nice and deterministic. But the real world isn’t.

Reproducing behavior is possible in an event-sourced system, if that behavior is deterministic. In human behavior, we don’t get that luxury. Our choices come from many influences, some of them contradictory. One tweet inspired me to write this article. Thousands of other tweets distract me from it.

Conflicting information comes in from real life.

Event sourcing gets tricky when the real world we are modeling is inconsistent, according to the events that come in.

Now say we’re a shipping company. We model the movement of goods in containers as they move across the world. It is an event when a container is loaded on a ship, and an event when it is unloaded. An event when a ship’s itinerary is scheduled, and when it arrives at each port.

One event says that container 1418 was loaded onto the vessel Enceladus in Auckland. Another event says that Enceladus is scheduled for its next stop in Beijing. Another event says that container 1418 was unloaded in San Francisco. Another says that container 1418 was emptied in Beijing. Which do you believe?

This example comes from a real story. Weird things happen. Does your system let people report reality? Is there a fallback for “Ask a person to go look for that container. Is it really 1418?”

Decisions made in ambiguity are events

Whatever decision the system makes, it needs to record that as an event. Perhaps that shows up as a footnote in reports about Enceladus, Beijing, and San Francisco. Does anybody hear about it in Auckland?

We can see the provenance of each report and decision

If some report comes out uneven, and that feeds back to the development team as a bug, then event sourcing gives us excellent tools for tracking it down.

Each “I made this decision” or “I produced this report” event can record the set of events that were input, and the version of code that ran to produce the output. You can have complete provenance.

This kind of software is accountable. It can tell the story of its decisions, what it did and why. What its world was like at that time.

This is a beautiful property. With full provenance, we can understand what happened. We can tell the story to each other. With replayability, we can change the code and see whether we can improve it for next time.

Recording everything gets ridiculous

Yet, data about provenance gets big very quickly. Each report consumed thousands of events. Each decision that was based on a current-state sum of events now has a dependency on all of those past events, plus the code that defines the current state, plus all the other states it took input from, plus their code and set of events.

Meanwhile some of those events are old, and no longer fit the format expected by the latest code. Meanwhile, we’re still ignoring everything that happened outside the system, so we’re completely blind to a lot of causality. “A person clicked this button.” Why? What information did they see on the screen as input to their decision to click “Container 1418 is in San Francisco”?

In real life, most information is lost. History will never be fully written; the writing is itself history. We’re always generating new actions. The system could theoretically report on all the reports it has reported. It never ends.

Completeness is limited to very small systems. Be careful where you invest this effort. Consciously select the boundaries, outside of which you don’t know what happened. You don’t know what really happened in the shipyard, or in a person’s head, or in the software that another company runs. The slice of the world we see is tiny.

Provenance is precious but difficult. Then again, it is at least as hard to do well in designs other than event sourcing. The painful realities that make event sourcing are painful in other models, too.

There are reasons we don’t model the whole world.

Event sourcing makes a best effort to model the world in its fullness. We try to remember everything significant that happens, sum that up into our own current-state world in the software, make decisions and act.

But events come in out of order. Events are lost. Events contradict each other. Events have partial data, or old data formats. Logic changes. We can’t remember everything.

Sometimes it pays to think about what you would do in an event-sourced system, and then implement just enough of that. Keep copies of produced reports, so that people can retrieve them without re-generating them. Record difficult decisions in a place that lives longer than logs.

Event sourcing is powerful. But it is not easy. Expect to think really hard about edge cases you didn’t want to handle. Expect to deal with storage and speed and up-to-dateness tradeoffs. Allow a human to enter corrections, because the real world will always surprise you.

In the real world, we don’t have all the information, and that’s OK. We can’t model everything in our heads, because our heads are inside everything. This keeps it interesting.

Distance outside of maps

Distance is seriously strange.

Yesterday on the southern coast of Spain, at an Italian restaurant run by a German, I had tiramisu, because I’ve never had tiramisu this close to Italy. People laughed,
because Spain is farther from Italy than Germany or Poland (geographically) – but for food purposes it’s closer, right?

Geographic distance is so nice, on a map, so clear and measurable.
And it’s almost never relevant.

Sydney is farther from SF than SF is from Sydney, by 2 hours of flying, because of wind.
St Louis is farther than San Francisco from Europe, because there are direct flights to SF.

Today in Frankfurt I went from A gates to Z gates. Sounds far! … except in the map, Z is right on top of A. Which does not make it close, because the path from A to Z goes through passport control.


Forget maps. They’re satisfying, fun, and deceptive, because they give us the feeling we understand distance.

Distance is fluid, inconstant. Gates are closer when the sidewalk is moving, farther when people are bunched up and slow.

In software systems, distance is all kinds of inconsistent. Networks get slow and computers get farther apart. WiFi goes down and suddenly they’re on another planet.

And here’s the thing about distance: it’s crucial to our understanding of time.
One thing distributed systems can teach us about the real world: there is no time outside of place. There is no ordering of events across space. There is only what we see in a particular spot.

Roland Kuhn spoke at J on the Beach about building reliable systems in the face of fluctuating distance like this. The hardest part is coming up with a consistent (necessarily fictional) ordering of events, so programs can make decisions based on those events.

Humans deal all the time with ambiguity, with “yeah this person said stop over here, and also this machine kept going over there.” We don’t expect the world to
have perfect consistency. Yet we wish it did, so we create facsimiles of certainty in our software.  

Distributed systems teach us how expensive that is. how limiting.

Martin Thompson’s talk about protocols had real-life advice for collaborating over fluctuating distance. Think carefully about how we will interact, make decisions locally, deal with feedback and recover.

Distance is a thing, and it is not simple or constant. Time is not universal, it is always located in space. Humans are good at putting ambiguous situations together, at sensemaking. This is really hard to do in a computer.
Software, in its difficulty, teaches us to appreciate our own skills.

Provenance and causality in distributed systems

Can you take a piece of data in your system and say what version of code put it in there, based on what messages from other systems? and what information a human viewed before triggering an action?

Me neither.

Why is this acceptable? (because we’re used to it.)
We could make this possible. We could trace the provenance of data. And at the same time, mostly-solve one of the challenges of distributed systems.

Speaking of distributed systems…

In a distributed system (such as a web app), we can’t say for sure what events happened before others. We get into general relativity complications even at short distances, because information travels through networks at unpredictable speeds. This means there is no one such thing as time, no single sequence of events that says what happened before what. There is time-at-each-point, and inventing a convenient fiction to reconcile them is a pain in the butt.

We usually deal with this by funneling every event through a single point: a transactional database. Transactions prevent simultaneity. Transactions are a crutch.

Some systems choose to apply an ordering after the fact, so that no clients have to wait their turn in order to write events into the system. We can construct a total ordering, like the one that the transactional database is constructing in realtime, as a batch process. Then we have one timeline, and we can use this to think about what events might have caused which others. Still: putting all events in one single ordering is a crutch. Sometimes, simultaneity is legit.

When two different customers purchase two different items from two different warehouses, it does not matter which happened first. When they purchase the same item, it still doesn’t matter – unless we only find one in inventory. And even then: what matters more, that Justyna pushed “Buy” ten seconds before Edith did, or that Edith upgraded to 1-day shipping? Edith is in a bigger hurry. Prioritizing these orders is a business decision. If we raise the time-ordering operation to the business level, we can optimize that decision. At the same time, we stop requiring the underlying system to order every event with respect to every other event.

On the other hand, there are events that we definitely care happened in a specific sequence. If Justyna cancels her purchase, that was predicated on her making it. Don’t mix those up. Each customer saw a specific set of prices, a tax amount, and an estimated ship date. These decisions made by the system caused (in part) the customer’s purchase. They must be recorded either as part of the purchase event, or as events that happened before the purchase.

Traditionally we record prices and estimated ship date as displayed to the customer inside the purchase. What if instead, we thought of the pricing decision and the ship date decision as events that happened before the purchase? and the purchase recorded that those events definitely happened before the purchase event?

We would be working toward establishing a different kind of event ordering. Did Justyna’s purchase happen before Edith’s? We can’t really say; they were at different locations, and neither influenced the other. That pricing decision though, that did influence Justyna’s purchase, so the price decision happened before the purchase.

This allows us to construct a more flexible ordering, something wider than a line.

Causal ordering

Consider a git history. By default, git log prints a line of commits as if they happened in that order — a total ordering.

But that’s not reality. Some commits happen before others: each commit I make is based on its parent, and every parent of that parent commit, transitively. So the parent happened before mine. Meanwhile, you might commit to a different branch. Whether my commit happened before yours is irrelevant. The merge commit brings them together; both my commit and yours happen before the merge commit, and after the parent commit. There’s no need for a total ordering here. The graph expresses that.

This is a causal ordering. It doesn’t care so much about clock time. It cares what commits I worked from when I made mine. I knew about the parent commit, I started from there, so it’s causal. Whatever you were doing on your branch, I didn’t know about it, it wasn’t causal, so there is no “before” or “after” relationship to yours and mine.

We can see the causal ordering clearly, because git tracks it: each commit knows its parents. The cause of each commit is part of the data in the commit.

Back to our retail example. If we record each event along with the events that caused it, then we can make a graph with enough of a causal ordering.

There are two reasons we want an ordering here: external consistency and internal legibility.

External Consistency

External consistency means that Justyna’s experience remains true. Some events are messages from our software system to Justyna (the price is $), and others are messages coming in (Confirm Purchase, Cancel Purchase). The sequence of these external interactions constrains any event ordering we choose. Messages crossing the system boundary must remain valid.

Here’s a more constricting example of external consistency: when someone runs a report and sees a list of transactions for the day, that’s an external message. That message is caused by all the transactions reported in it. If another transaction comes in late, it must be reported later as an amendment to that original report — whereas, if no one had run the report for that day yet, it could be lumped in with the other ones. No one needs to know that it was slow, if no one had looked.

Have you ever run a report, sent the results up the chain, and then had the central office accuse you of fudging the numbers because they run the same report (weeks later) and see different totals? This happens in some organizations, and it’s a violation of external consistency.

Internal Legibility

Other causal events are internal messages: we displayed this price because the pricing system sent us a particular message. The value of retaining causal information here is troubleshooting, and figuring out how our system works.

I’m using the word “legibility”[1] in the sense of “understandability:” as a person we have visibility into the system’s workings, we can follow along with what it’s doing. Distinguish its features, locate problems and change it.

 If Justyna’s purchase event is caused by a ship date decision, and the ship date decision (“today”) tracked its causes (“the inventory system says we have one, with more arriving today”), then we can construct a causal ordering of events. If Edith’s purchase event tracked a ship date decision (“today”) which tracked its causes (“the inventory system says we have zero, with more arriving today”), then we can track a problem to its source. If in reality we only send one today, then it looks like the inventory system’s shipment forecasts were inaccurate.

How would we even track all this?

The global solution to causal ordering is: for every message sent by a component in the system, record every message received before that. Causality at a point-in-time-at-a-point-in-space is limited to information received before that point in time, at that point in space. We can pass this causal chain along with the message.

“Every message received” is a lot of messages. Before Justyna confirmed that purchase, the client component received oodles of messages, from search results, from the catalog, from the ad optimizer, from the review system, from the similar-purchases system, from the categorizer, many more. The client received and displayed information about all kinds of items Justyna did not purchase. Generically saying “this happened before, therefore it can be causal, so we must record it ALL” is prohibitive.

This is where business logic comes in. We know which of these are definitely causal. Let’s pass only those along with the message.

There are others that might be causal. The ad optimizer team probably does want to know which ads Justyna saw before her purchase. We can choose whether to include that with the purchase message, or to reconstruct an approximate timeline afterward based on clocks in the client or in the components that persist these events. For something as aggregated as ad optimization, approximate is probably good enough. This is a business tradeoff between accuracy and decoupling.

Transitive causality

How deep is the causal chain passed along with a message?

We would like to track backward along this chain. When we don’t like the result of Justyna and Edith’s purchase fulfillment, we trace it back. Why did the inventory system said the ship date would be today in both cases. This decision is an event, with causes of “The current inventory is 1” and “Normal turnover for this item is less than 1 per day”; or “The current inventory is 0” and “a shipment is expected today” and “these shipments usually arrive in time to be picked the same day.” From there we can ask whether the decision was valid, and trace further to learn whether each of these inputs was correct.

If every message comes with its causal events, then all of this data is part of the “Estimated ship date today” sent from the inventory system to the client. Then the client packs all of that into its “Justyna confirmed this purchase” event. Even with slimmed-down, business-logic-aware causal listings, messages get big fast.

Alternately, the inventory system could record its decision, and pass a key with the message to the client, and then the client only needs to retain that key. Recording every decision means a bunch of persistent storage, but it doesn’t need to be fast-access. It’d be there for troubleshooting, and for aggregate analysis of system performance. Recording decisions along with the information available at the time lets us evaluate those decisions later, when outcomes are known.

Incrementalness

A system component that chooses to retain causality in its events has two options: repeat causal inputs in the messages it sends outward; or record the causal inputs and pass a key in the messages it sends outward.

Not every system component has to participate. This is an idea that can be rolled out gradually. The client can include in the purchase event as much as its knows: the messages it received, decisions it made, and relevant messages sent outward before this incoming “Confirm Purchase” message was received from Justyna. That’s useful by itself, even when the inventory system isn’t yet retaining its causalities.

Or the inventory system could record its decisions, the code version that made them, and the inputs that contributed to them, even though the client doesn’t retain the key it sends in the message. It isn’t as easy to find the decision of interest without the key, but it could still be possible. And some aggregate decision evaluation can still happen. Then as other system components move toward the same architecture, more benefits are realized.

Conscious Causal Ordering

The benefits of a single, linear ordering of events are consistency, legibility, and visibility into what might be causal. A nonlinear causal ordering gives us more flexibility, consistency, a more accurate but less simplified legibility, and clearer visibility into what might be causal. Constructing causal ordering at the generic level of “all messages received cause all future messages sent” is expensive and also less meaningful than a business-logic-aware, conscious causal ordering. This conscious causal ordering gives us external consistency, accurate legibility, and visibility into what we know to be causal.

At the same time, we can have provenance for data displayed to the users or recorded in our databases. We can know why each piece of information is there, and we can figure out what went wrong, and we can trace all the data impacted by an incorrect past event.

I think this is something we could do, it’s within our ability today. I haven’t seen a system that does it, yet. Is it because we don’t care enough — that we’re willing to say “yeah, I don’t know why it did that, can’t reproduce, won’t fix”? Is it because we’ve never had it before — if we once worked in a system with this kind of traceability, would we refuse to ever go back?


[1] This concept of “legibility” comes from the book Seeing Like a State.