Capturing the World in Software

TL;DR – we can get a complete, consistent model of a small piece of the world using Event Sourcing. This is powerful but expensive.

Today on twitter, Jimmy Bogard on the tradeoffs of Event Sourcing:

If event sourcing is not scalable, faster, or simpler, why use it?

Event Sourcing gives you a complete, consistent model of the slice of the world modeled by your software. That’s pretty attractive.

We want to model the real world in software.

You can think about the present world as a sum of everything that happened before. Looking around my room, I can say that my bookshelf is the sum of various purchases, some moving around, a set of decisions about what to read and what to keep.

my bookshelf has philosophy, math, visualization, and a hippo

I can think of myself as the sum of everything that has happened, plus the stories I told myself about that. My next action is an outcome of this, plus my present surroundings, plus incoming events. That action itself is an event in the world.

In life, in biology, we don’t get to see all these inputs. We don’t get to change the response algorithm and try again. But in software, we can!

Of course we want perfect modeling and traceability of decisions! This way we can always answer “why,” and we can improve our understanding and decisionmaking strategies as we learn.

This is what Event Sourcing offers.

We want our model to be complete and consistent.

It’s impossible to model the entire world. Completeness and consistency are in conflict, sadly. Still, if we limit “complete” to a business domain, and to the boundaries of our company, this is possible. Theoretically.

Event Sourcing offers a way to do that.

In event sourcing, every piece of input is an event. Someone requests a counseling appointment, event. Provider signs up for available hours, event. Appointment scheduled, event. Customer notified, event. Customer shows up, event. Session report filed, event.

We can sum past events to get the current state

Skim the timeline of all events for the relevant ones. Sum these up (there are other definitions of “sum” besides adding numbers). From this we calculate the state of the world.

From appointment-was-scheduled events, we construct a provider’s calendar for the day.

At the end of the month, we construct reports on customers served and provider utilization. Based on that, we might seek more providers or have a talk with the less active ones. Headquarters ranks the performance of our office compared with others.

We need to allow corrections

To accurately model the real world, we need to allow for all the stuff that happens in the real world.

Appointments are cancelled. Customers don’t show up. Session reports are filed late. (“Where’s that session report from last week?” “Oh right, they were too late, because the gate to the parking lot malfunctioned. Don’t charge them for it.”)

Data is late or lost. If you insist that this doesn’t happen (“Every provider must enter the session reports by the end of the day”) then your model is blind to reality. The weather turns bad, people go home. There’s a bomb threat, or an active shooter. Reality intrudes.

Events outside your careful model will happen. Accommodate corrections, incorporate events that arrive late, accept partial data. The more of reality you allow into your model, the more accurate it can be.

We can evaluate past decisions based on the information available at the time

When data arrives late, reports change after they are printed. An event sourced system handles this.

As new data comes in about past days, it gets summed in with the data about those days. Reports get more accurate.

A friend of mine works at a counseling center, and he gets calls from headquarters like “Why is your utilization so low for December?” and he’s like “What? It was fine” and then he runs the report again and sure enough, it’s different. After he ran the report, more data about December came in, and now the totals are different. He can’t reproduce the reports he saw, which makes it hard to explain his actions to HQ.

If their software used event sourcing, he could say, “Please run the report as of January 2, and you’ll see why I didn’t take any action.”

Each event records a received timestamp, for when we learned about it, and an effective timestamp, for the real-world happening it represents. Then the software can sum only the events received before January 2 to reproduce the report as it was seen that day.

We can re-evaluate the world with new logic

Not only can an event-sourced system reproduce the same report as on an earlier day, we can ask: what if we changed the report logic? Then what would it look like?

Maybe we want to report unreported appointments as “possibly cancelled” to reflect uncertainty. We can run the new logic against the same events and compare it to the old results.

This means we can run tests against the event stream and detect behavior changes.

We need to record externally-visible decisions for consistency

When we change the software, we endanger consistency.

If we update the report logic in February, then when HQ runs the report “as of January 2” they’ll see something different than my friend saw when he ran it on that date. For consistency, both the data and code need to match what existed on January 2.

Or, we can model the report itself as an event. “On January 2, I said this about December.” Then we can incorporate that into the reporting logic.

Anything our system does that is visible to the outside world is itself an event, because it changes the way external people and software act. To reproduce our behavior consistently, our system can either record its own behavior, or retain all the data and the code that went into choosing it.

So far, this is nice and deterministic. But the real world isn’t.

Reproducing behavior is possible in an event-sourced system, if that behavior is deterministic. In human behavior, we don’t get that luxury. Our choices come from many influences, some of them contradictory. One tweet inspired me to write this article. Thousands of other tweets distract me from it.

Conflicting information comes in from real life.

Event sourcing gets tricky when the real world we are modeling is inconsistent, according to the events that come in.

Now say we’re a shipping company. We model the movement of goods in containers as they move across the world. It is an event when a container is loaded on a ship, and an event when it is unloaded. An event when a ship’s itinerary is scheduled, and when it arrives at each port.

One event says that container 1418 was loaded onto the vessel Enceladus in Auckland. Another event says that Enceladus is scheduled for its next stop in Beijing. Another event says that container 1418 was unloaded in San Francisco. Another says that container 1418 was emptied in Beijing. Which do you believe?

This example comes from a real story. Weird things happen. Does your system let people report reality? Is there a fallback for “Ask a person to go look for that container. Is it really 1418?”

Decisions made in ambiguity are events

Whatever decision the system makes, it needs to record that as an event. Perhaps that shows up as a footnote in reports about Enceladus, Beijing, and San Francisco. Does anybody hear about it in Auckland?

We can see the provenance of each report and decision

If some report comes out uneven, and that feeds back to the development team as a bug, then event sourcing gives us excellent tools for tracking it down.

Each “I made this decision” or “I produced this report” event can record the set of events that were input, and the version of code that ran to produce the output. You can have complete provenance.

This kind of software is accountable. It can tell the story of its decisions, what it did and why. What its world was like at that time.

This is a beautiful property. With full provenance, we can understand what happened. We can tell the story to each other. With replayability, we can change the code and see whether we can improve it for next time.

Recording everything gets ridiculous

Yet, data about provenance gets big very quickly. Each report consumed thousands of events. Each decision that was based on a current-state sum of events now has a dependency on all of those past events, plus the code that defines the current state, plus all the other states it took input from, plus their code and set of events.

Meanwhile some of those events are old, and no longer fit the format expected by the latest code. Meanwhile, we’re still ignoring everything that happened outside the system, so we’re completely blind to a lot of causality. “A person clicked this button.” Why? What information did they see on the screen as input to their decision to click “Container 1418 is in San Francisco”?

In real life, most information is lost. History will never be fully written; the writing is itself history. We’re always generating new actions. The system could theoretically report on all the reports it has reported. It never ends.

Completeness is limited to very small systems. Be careful where you invest this effort. Consciously select the boundaries, outside of which you don’t know what happened. You don’t know what really happened in the shipyard, or in a person’s head, or in the software that another company runs. The slice of the world we see is tiny.

Provenance is precious but difficult. Then again, it is at least as hard to do well in designs other than event sourcing. The painful realities that make event sourcing are painful in other models, too.

There are reasons we don’t model the whole world.

Event sourcing makes a best effort to model the world in its fullness. We try to remember everything significant that happens, sum that up into our own current-state world in the software, make decisions and act.

But events come in out of order. Events are lost. Events contradict each other. Events have partial data, or old data formats. Logic changes. We can’t remember everything.

Sometimes it pays to think about what you would do in an event-sourced system, and then implement just enough of that. Keep copies of produced reports, so that people can retrieve them without re-generating them. Record difficult decisions in a place that lives longer than logs.

Event sourcing is powerful. But it is not easy. Expect to think really hard about edge cases you didn’t want to handle. Expect to deal with storage and speed and up-to-dateness tradeoffs. Allow a human to enter corrections, because the real world will always surprise you.

In the real world, we don’t have all the information, and that’s OK. We can’t model everything in our heads, because our heads are inside everything. This keeps it interesting.

Definition of DevOps

Step 1: On one team, put the people with the knowledge and control necessary to change the software, see the results, change it, see the results.

Step 2: Use automation to take extrinsic cognitive load off this team, so that it needs fewer people.

That’s it, that’s DevOps.

Step 1 describes the cultural change that leads to flow. Delivering change requires no handoffs or approvals from outside the team; the impact of change flows back to the team. Act and learn.

Step 2 is where tools come in. If all you do is improve your tooling, well, that helps a little, but it doesn’t get you the qualitative change in flow. That comes from Step 1. The serious value of automation is that it enables Step 1, a single team with all the relevant knowledge.

Our job as developers is making decisions. DevOps gets us the knowledge we need to make good decisions, the authority to implement them, and the feedback to make better ones in the future.

Taking care of code … more and more code

(This is a shorter version of my talk for DeliveryConf, January 2020. Slides)

Good software is still alive.

The other day, I asked my twelve year old daughter for recommendations of drawing programs. She told me about one (FireAlpaca?) “It’s free, and it updates pretty often.” She contrasted that with one that cost money “but isn’t as good. It never updates.”

The next generation appreciates that good software is updated regularly. Anything that doesn’t update is eventually terrible.

Software that doesn’t change falls behind. People’s standards rise, their needs change. At best, old software looks dumb. At worst, it doesn’t run on modern devices.

Software that doesn’t change is dead. You might say, if it still runs, it is not dead. Oh, sure, it’s moving around — but it’s a zombie. If it isn’t learning, it’ll eventually fall over, and it might eat your face.

I want to use software that’s alive. And when I make software, I want it to stay alive as long as it’s in use. I want it be “done” when it’s out of production.

Software is like people. The only “done” is death.

Alive software belongs to a team.

What’s the alternative? Keep learning to keep living. Software needs to keep improving, at least in small ways, for as long as it is running.

We have to be able to change it, easily. If Customer Service says, “Hey, this text is unclear, can you change it to this?” then pushing that out should be as easy as updating some text. It should be not be harder than when the software was in constant iteration.

This requires automated delivery, of course. And you have to know that delivery works. So you have to have run it recently.

But it takes more than that. Someone has to know — or find out quickly — where that text lives. They have to know how to trigger the deployment and how to check whether it worked.

More than that, someone has to know what that text means. A developer needs to understand that application. Probably, this is a developer who was part of its implementation, or the last major set of changes.

For the software to be alive, it has to be alive in someone’s head.

And one head will never do; the unit of delivery is the team. That’s more resilient.

Alive software is owned and cared for by an active team. Some people keep learning, keep teaching the software, and the shared sociotechnical system keeps living. The team and software form a symmathesy.

How do we keep all our software alive, while still growing more?

Okay, but what if the software is good enough right now? How do we keep it alive when there’s no big initiative to change it?

Hmm. We can ask, what kind of code is easy to change?

Code needs to be clean and modern.

Well, it’s consistent. It is up-to-date with the language versions and frameworks and libraries that we currently use for development.

It is “readable” by our current selves. It uses familiar styles and idioms.

What you don’t want is to come at the “simple” (from outside perspective) task of updating some text, and find you need to install a bunch of old tools, oh wait, there’s security patches that need to happen before this will pass pre-deployment checks. Oh now we have to upgrade more stuff to the modern versions of those libraries to work. You don’t want to have to resuscitate the software before you can breathe new life into it.

If changing the software isn’t easy enough, we won’t do it. And then it gets terrible.

So all those tool upgrades, security patches, library updates gotta have been done already, in the regular course of business.

Keeping those up to date gives us an excuse to change the code, trigger a release, and then notice any problems in the deployment pipeline. We keep confidence that we can deploy it, because we deploy it every week whether we need to or not.

People need to be on stable teams with customer contact.

More interesting than properties of the code: what are some properties of people who can keep code alive?

The team is stable. There’s continuity of knowledge.

The team understands the reason the software exists. The business significance of that text and everything else.

And we still care. We have contact with people who use this software, so we can check in on whether this text change works for them. We continue to learn.

Code belongs to one team.

More interesting still: what kind of relationship does the alive-keeping team have with the still-alive code?

Ownership. The code is under the care of a single team.

Good communication. We can teach the code (by changing it), so we have good deployment automation and we understand the programming language, etc. And the code can teach us — it has good tests, so we know when we broke something. It is accountable to us, in the sense that it can tell us the story of what happens. This means observability. With this, we can learn (or re-learn) how it works while it’s running. Keep the learning moving, keep the system living.

The team is a learning system, within a learning system.

Finally: what kind of environment can hold such a relationship?

(diagram of code, people, relationship, environment)

It’s connected; the teams are in touch with the people who use software, or with customer support. The culture accepts continued iteration as good, it doesn’t fear change. Learning flows into and out of the symmathesy.

It supports learning. Software is funded as a profit center, as operational costs, not as capital expenditure, where a project is “done” and gets deprecated over years. How the accounting works around development teams is a good indication of whether a company is powered by software, or subject to software.

Then there’s the tricky one: the team doesn’t have too much else on their plate.

How do we keep adding code to our responsibilities?

The team that owns this code also owns other code. We don’t want to update libraries all day across various systems we’ve written before. We want to do new work.

It’s like a garden; we want to keep the flowers we planted years ago healthy, and we also want to plant new flowers. How do we increase the number of plants we can care for?

And, at a higher level — how can we, as people who think about DevOps, make every team in our organization able to keep code alive?

Teams are limited by cognitive load.

This is not: how do we increase the amount of work that we do. If all we did was type the same stuff all the time, we know what to do — we automate it.

Our work is not typing; it’s making decisions. Our limitation is not what we can do, it is what we can know.

In Team Topologies, Manuel Pais and Matthew Skelton emphasize: the unit of delivery of a team, and the limitation of a team is cognitive load.

We have to know what that software is about, and what the next software we’re working on is about. and the programming languages they’re in, and how to deploy them, and how to fill out our timesheets and which kitchen has the best bubbly water selection, and who just had a baby, and — it takes a lot of knowledge to do our work well.

Team Topologies lists three categories of cognitive load.

The germane cognitive load, we want that.

Germane cognitive load is the business domain. It is why our software exists. We want complexity here, because the more complex work our software does, the less the people who use it have to bother with. Maximize the percentage of our cognitive load taken up by this category.

So which software systems a team owns matters; group by business domain.

Intrinisic cognitive load increases if we let code get out of date.

Intrinsic cognitive load is essential to the task. This is our programming language and frameworks and libraries. It is the quirks of the systems we integrate with. How to write a healthy database query. How the runtime works: browser behavior, or sometimes the garbage collector.

The fewer languages we have to know, the better. I used to be all about “the best language for the problem.” Now I recommend “the language your team knows best, as long as it’s good enough.”

And “fewer” includes versions of the language, so again, consistency in the code matters.

Extrinsic cognitive load is a property of the work environment. Work on this

Finally, extrinsic cognitive load is everything else. It’s the timesheet system. The health insurance forms. It’s our build tools. It’s Kubernetes. It’s how to get credentials to the database to test those queries. It’s who has to review a pull request, and when it’s OK to merge.

This is not the stuff we want to spend our brain on. The less extrinsic cognitive load on the team, the more we have room for the business and systems knowledge, the more responsibility we can take on.

And this is a place where carefully integrated tools can help.

DevOps is about moving system boundaries to work better. How can we do that?

We can move knowledge within the team, and we can move knowledge out to a different team.

We can move work below the line.

Within the team, we can move knowledge from the social side to the technical side of the symmathesy. We can package up our personal knowledge into code that can be shared.

Automations encapsulate knowledge of how to do something

Automate bits of our work. I do this with scripts.

The trick is, can we make sharing it with the team almost as easy as writing it for ourselves?

Especially automate anything we want to remain consistent.

For instance, when I worked on the docs at Atomist, I wrote the deployment automation for them. I made a glossary, and I wanted it in alphabetical order. I didn’t to put it in alphabetical order; I wanted it to constantly be alphabetical. This is a case for automation.

I wrote a function to alphabetize the markdown sections, and told it to run with every build and push the changes back to the repository.

Autofixes like this also keep the third party licenses up to date (all the npm dependencies and their licenses). This is a legal requirement that a human is not going to do. Another one puts the standard license header on any code that’s committed without it. So I never copied the headers, I just let the automation do that. Formatting and linting, same thing.

If you care about consistency, put it in code. Please don’t nag a human.

Some of that knowledge can help with keeping code alive

Then there’s all that drudgery of updating versions and code styles etc etc — weeding the section of the garden we planted last year and earlier. how much of that can we automate?

We can write code to do some of our coding for us. To find the inconsistencies, and then fix some of them.

Encapsulate knowledge about -when- to do something

Often the work is more than knowledge of -how- to do something. It is also -when-, and that takes requires attentiveness. Very expensive for humans. When my pull request has been approved, then I need to push merge. Then I need to wait for a build, and then I need to use that new artifact in some other repository.

Can we make a computer wait, instead of a person?

This is where you need an event stream to run automations in response to.

Galo Navarro has an excellent description of how this helped smooth the development experience at Adevinta. They created an event hub for software development and operations related activities, called Devhose. (This is what Atomist works to let everyone do, without implementing the event hub themselves.)

We can move some of that to a platform team.

Yet, every automation we build is code that we need to keep alive.

We can move knowledge across team boundaries, with a platform team. I want my team’s breadth of responsibility to increase, as we keep more software alive, so I want its depth to be reduced.

Team Topologies describes this structure. The business software teams are called “stream aligned” because they’re working in a particular value stream, keeping software alive for someone else. We want to thin out their extrinsic cognitive load.

Move some it to a platform team. That team can take responsibility for a lot of those automations. And deep knowledge of delivery and operational tooling. Keep the human judgement of what to deploy when in the stream-aligned teams, and a lot of the “how” and “some common things to watch out for” in the platform team.

Some things a platform team can do:

  • onboarding
  • onboarding of code (delivery setup)
  • delivery
  • checks every team needs, like licenses

And then, all of this needs to stay alive, too. Your delivery process needs to keep updating for every repository. If delivery is event-based, and the latest delivery logic responds to every push (instead of what the repo was last configured for), then this keeps happening.

But keep thinning our platforms.

Platforms are not business value, though. We don’t really want more and more software there, in the platform.

We do want to keep adding services and automation that helps the team. But growing the platform team is not a goal. Instead, we need to make our platforms thinner.

There is such a thing as “done”

The best way to thin our software is outsourcing to another company. Not the development work, not the decisions. But software as a service, IaaS, logging, tooling of all sorts — hire a professional. Software someone else runs is tech debt you don’t have.

So maybe Galo could move Devhose on top of Atomist and retire some code.

Because any code that isn’t describing business complexity, we do want to die. As soon as we can move onto someone else’s service, win. Kill it, take it out of production. Then, finally, it’s done.

So yeah. There is such a thing as done. “Done” is death. You don’t want it for your value-producing code. You do want it for all other code you run.

Don’t do boring work.

If keeping software alive sounds boring, then let’s change that. Go up a level of abstraction and ask, how much of this can we automate?

Writing code to change code is hard. Automating is hard.

That will challenge your knowledge of your own job, as you try to encode it into a computer. Best case, you get the computer doing the boring bits for you. Worst case, you learn that your job really is hard, and you feel smart.

Keep learning to keep living. Works for software, and it works for us.

Fun with Docker: "Release file… is not valid yet"

Today my Docker build failed on Windows because apt-get update failed because some release files were not valid yet. It said they’d be valid in about 3.5 hours. WAT.

I don’t care about your release files! Do not exit with code 100! This is not what I want to think about right now!

Spoiler: restarting my computer fixed it. 😤

This turned out to be a problem with the system time. The Ubuntu docker containers thought it was 19:30 UTC, which is like 8 hours ago. Probably five hours ago, someone updated the release files wherever apt-get calls home to. My Docker container considered that time THE FUTURE. The scary future.

Windows had the time right, 21:30 CST (which is 6 hours earlier than UTC). Ubuntu in WSL was closer; it thought it was 19:30 CST. But Docker containers were way off. This included Docker on Windows and Docker on Ubuntu.

Entertainingly, the Docker build worked on Ubuntu in WSL. I’m pretty sure that’s because I ran this same build there long ago, and Docker had the layers cached. Each line in the Dockerfile results in a layer, so Docker starts the build operation at the first line that has changed. So it didn’t even run the apt-get update.

This is one of the ways that Docker builds are not reproducible. apt-get calls out to the world, so it doesn’t do the same thing every time. When files were updated matters, and (now I know) what time your computer thinks it is matters.

Something on the internet suggested restarting the VM that Docker uses. It seems likely that Docker on WSL and Docker on Windows (in linux-container mode) are using the same VM under the hood somewhere. I don’t know how to restart that explicitly, so I restarted the computer. Now all the clocks are right (Windows, Ubuntu in WSL, and Ubuntu containers from both Docker daemons). Now the build works fine.

I’m not currently worried about running containers in production. (I just want to develop this website without installing python’s package manager locally. This is our world.) Still working in Docker challenges me to understand more about operating systems, their package managers, networking, system clocks, etc.

Docker: it puts the Ops in DevOps. That’s my day.

I don't want the best

“Which database is the best?”

Superlatives are dangerous. They pressure you to line up the alternatives in a row, judge them all, make the perfect decision.

This implies that databases can be compared based on internal qualities. Performance, reliability, scalability, integrity, what else is there?

We know better than that. We recognize that relational databases, graph databases, document stores etc serve different use cases.

“What are you going to do with it?”

This is progress; we are now considering the larger problem. Can we define the best database per use case?

In 7 Rules of Effective Change, Esther Derby recommends balance among (1) inner needs and capabilities, (2) the needs and capabilities of those near you, and (3) the larger context and problem. (This is called “congruence.”)

“What is the best?” considers only (1) internal qualities. Asking “What are you doing with it?” adds (3) the problem we’re solving. What about (2) the people and systems nearby?

Maybe you want to store JSON blobs and the high-scale solution is Mongo. But is that the “best” for your situation?

Consider:

  • does anyone on the team know how to use and administer it?
  • are there good drivers and client libraries for your language?
  • does it run well in the production environment you already run?
  • do the frameworks you use integrate with it?
  • do you have diagnostic tools and utilities for it?

Maybe you’ve never used Mongo before. Maybe you already know PostgreSQL, and it already runs at your company. Can you store your JSON blobs there instead? Maybe we don’t need anything “better.”

The people we are and systems we have matter. If a component works smoothly with them, and is good enough to do the task at hand, great. What works for you is better than anyone else’s “best.”

Avoid Specificity in Expectations

Today my daughter overslept and I had to take her to school. Usually when that happens, I get all grouchy and resentful. Dammit, I thought I’d get a nice quiet coffee but NO, I’m scraping ice off the car.

Today I didn’t mind. It was fine. It’s the first day back after winter break, so I expected her to oversleep. I expected to take her to school.

I expected to scrape ice off the car… OK no, I forgot about that part and did get a little grouchy.

Our feelings surface from the difference between expectations and reality. Not from reality by itself.

In our career, people may ask us what we want in our next role.
“Where do you want to be in five years?” Answering this question hurts in several ways.

The more specific our vision of our future selves, the more we set ourselves up for disappointment.

We hide from ourselves all the other possibilities that we didn’t know about.

We tie our future self to the imagined desires of our present self.

In the next five years, I will gain new interests, new friends, and new ideas. Let those, along with the new situations I find myself in, guide my future steps.

Expectations are dangerous. We need some of them, especially in the near future, to take action. The less specific they can be, the more potential remains open to us, and the happier we can be.

Tomorrow, I will have a quiet coffee or take my daughter to school; I don’t have to know which until it happens. Five years from now, I have no idea what I’ll be doing — probably something cooler than my present self can imagine.

Go toward uncertainty

It’s difficult for an executive to criticize a budget when most line items are for mysterious high technology activities. It’s easier to tackle the more understandable portions, like postage, janitorial services, and consulting.

Jerry Weinberg

We want to help. We want to do stuff. We look for matches between what needs done and what we know how to do, and then we do it.

But does that always help?

Today in his newsletter, Marcus Blankenship told the story of the three bricklayers. What are you doing? “Laying bricks.” “Making a wall.” “Building a cathedral.”

We want to do something. We want to lay some bricks. As programmers, we want to write some tests, make a class, throw out an API.

After we break down a feature implementation into tasks, it is tempting to get started on the parts we know how to do. Knock those out, advance that progress bar. Make the wall taller.

Those are the least helpful parts to start with! In software development, our job is making decisions. What we need most is knowledge.

The pieces of the task most amorphous, kinda vague, the ones our brains want to slide past with hand-waves — those are the tasks that will give us more than apparent progress.

Integrate with that new service. Get authorization working. Pick the database and get familiar with it.

Digging into the uncertain tasks gives us information. We will learn how the API needs to be different than we thought. The data we didn’t know we needed, the unhappy-paths we didn’t know we needed to pave.

Lay the bricks slowly. Consider, what is going to hold this wall up? and what will this wall hold up?

In the opening quote, an executive tries to manage costs. They are drawn to the little nitpicky items that look approachable. But the easy ones are already pretty good! The giant Megatechnology items with big dollars next to them, these provoke handwaves. What might the executive gain by digging in and learning more about these? Maybe not cost-cutting, but definitely better decision making.

To help with the cathedral: be okay with uncertainty, sit in the unknown. Feel confused, be uncomfortable. Observe things that don’t yet make sense.

It’s tempting to start with what you most know how to do. Start instead with what you least know how to do.

Code (especially unreleased code) is a fast material to change, far faster than bricks. The hard part is deciding what to change it to. Aim for understanding, and progress will come to you.

Reducing WIP in laundry and code

Today I brought up a load of laundry. When doing chores, I practice keeping WIP (work in progress) to a minimum. Finish each thing, then do another one. This is good training for code.

For instance, on the way up the stairs with the basket, I saw a tumbleweed of cat hair. I didn’t pick it up. Right now I’m doing laundry.

I put the basket on the bed, pulled out a pair of pants, folded it, and then stopped.

Do I put the pants on the bed and fold the rest? Or do I put the pants away right now, then start the next piece of clothing?

It depends which one thing I’m doing: folding this load of laundry? or putting a piece of clothing in its place?

It’s like in software development. Do we slice work by functionality or by layer?

Feature slicing, where we do all components (front end, back end, database, etc) of one small change and release that before moving on: this is like folding the pants and putting them away before picking up another item.

Layered work, where we first make the whole database schema, then develop the back end, then create the front end: this is like folding all the clothes and then putting them all away.

Pants on the bed are WIP. When clothes are on the bed, the cat sits on them and I can’t put them away. Then when I want to nap, my bed still has clothes on it. WIP is liability, not value. I can’t access my bed, and no one has clean pants.

cat on towels

Yet, folding the laundry and then putting it away is more efficient. I might fold three pairs of pants, and then put them away all at once. Four towels, one trip to the bathroom closet. The process as a whole is faster and smoother (excluding the cats).

Is layered work more efficient in software? NO. It always takes far longer, with worse results. A lot of rework happens, and then we settle for something that isn’t super slick.

Why is laundry different?

Because I’ve folded and put away this same pair of pants many times before. On the same shelf. Household chores are rarely a discovery process.

If I hadn’t done this before, then I might fold all of Evelyn’s pants in fourths. That is standard practice, and my pants fit nicely in my cabinet that way. When I go to put Evelyn’s pants away, I’d find that her shelf is deeper. It’s just right for pants folded in thirds. Folded in fourths, they don’t all fit; I run out of height.

Now it’s time for rework: fold all her pants again, in thirds this time.

With feature slicing, I would fold one pair of pants in fourths, put it on the shelf, notice that it doesn’t fit well, refold it in thirds, and find that it fits perfectly. Every subsequent pair of her pants, I’d fold in thirds to begin with.

Completing a thin feature slice brings learning back to all future feature slices.

For repetitive chores, we can choose efficiency. For new work, such as all software development, aim for learning. That will make us fast.

Games or gaming, Work or working

Games aren’t much “fun” when rules, rather than relationships, dominate the activity, when there is no attention to “flow,” “fairness,” “respect” and “nice.”

Dr. Linda Hughes, “Beyond the Rules of the Game: Why Are Rooie Rules Nice?”

At a past job, we played Hearts every day at lunch. Out of a core group of 6-8, there were always at least four to participate. I worked there for about five years; we played over a thousand games together. We wore out dozens of decks of cards.

On top of the core rules of Hearts, we accrued a whole culture. There was the “Slone shooter” (the worst possible score, named for a former member of the group). We said “Panda Panda” (winning a trick with the highest card) and “Where’s the Jerboa?” (the two of clubs comes out to start the game) — both originated in our favorite deck, which had a different animal on each card. Long after that deck retired, the cards retained their animal names.

We had rules of etiquette. The player in the lead was the target, and everyone else works together to damage their score. Everyone was expected to make a logically justifiable play, except when succumbing to other players chanting “Panda Panda!” to summon the Ace of Hearts.

I’ve never since had that much fun at cards. The rules of the game are only the beginning.

See, you can think about games, or you can observe gaming. There are the rules as written, and then there’s the experience of the players.

It’s the same with work: you can talk about work-as-imagined, or you can look at work-as-done.

Work-as-imagined is the official process. It is how you’re supposed to get your work done. Work-as-done is real life.

This is why tabletop board games are more fun than their electronic equivalents. Everyone sees how the rules work, because we execute them ourselves. House rules evolve. When there’s ambiguity, the group decides what’s fair. Lovely traditions grow, jokes get funnier with repetition, and the game becomes richer than its rules.

It’s the same at work. Post-its on the wall give shared physical context, and they’re more flexible than any ticket-tracking software. (Software usually limits reality to what was imagined by its developers.)

Each collaborating team eventually develops its own small culture. Vocabulary, jokes, etiquette. These exist on top of (sometimes in spite of) decreed processes of work.

These interactions make work “fun.” They care about “fairness,” “respect,” and “nice.” They also lead to “flow” — flow of work through the team. Communication is smooth, collaboration is joyful and productive. This is how we win.

Waste, fast and slow

The problem isn’t the problem; coping is.

Virginia Satir

Throwing food in the trash feels wasteful. Sometimes I feel compelled to eat the food instead. Then I feel lethargic, I gain weight, and everything I do is slower.

Sometimes waste isn’t a problem. The world is not better for that food passing through my digestive system. Sometimes it’s preventing waste that hurts us.

Inefficient code uses compute and costs money. Even when the response time and throughput don’t impact the user, that’s still wasteful, right?

Waste that speeds you up

Say we optimize all our code. But then it’s harder to read and to change. We slow ourselves down with our fear of waste.

Duplicate code takes time to write and test. Maybe many teams use some handy functions for formatting strings and manipulating lists. It’s a waste to maintain this more than once!

Say we put that in a shared utility library. Now every change I make impacts a bunch of other teams, and every change they make surprises me. To save a bit of typing, we’ve taken on coupling and mushed up the code ownership. Everything we do is slower.

Waste that slows you down

On the other hand, duplicated business logic means every future change takes more work and coordination. That is some waste that will bite you.

In Office Space, there’s that one person who takes the requirements from one floor to another. His salary is a waste. Much worse: he wants to preserve his meager responsibilities, so he’ll prevent the people who use the software from talking to the developers. Everything is slower forever.

When you spot a distasteful waste, ask: does this waste speed me up, or does this waste slows me down forever?

I can be wasteful, and it’s okay sometimes. I’ll waste compute for faster and safer code changes. I’ll spend time on utility code to skip tripping up other teams.

Some waste is just waste. It is time spent once, or money spent ongoing, and that’s it. Some waste makes us more effective, saves us cognitive load of having to think about another thing. Some inefficiencies let us be more effective.

Other waste is sticky. It drags you into spending more time in the future. It pulls you into handoffs and queues and coupling.

Fight the sticky waste that spirals into more drag in the future. Let the other waste be. Throw the extra food in the trash; your future self will move lightly for it.