Some people work in a system, and some people work on a system.
Like, you can be the person who washes the dishes, or the person who installs and maintains the dishwasher.
You can be the person who assembles the reports every week, or the person who automates that report assembly. (Jacob Stoebel told this story on >Code #148 today. That’s how he got into software development.)
You can conform to a system, or you can participate fully — part of serving the system is changing the system to better serve you.
Developers are inherently system changers. That’s what we do. No wonder we’re hard to manage!
No wonder software communities are full of turmoil and rabble-rousing and shifting technologies: we are a whole industry full of system changers.
Also on >Code today, we talked about personal automation. Chante Thurmond remarked on the tools that exist today to let people (not just developers!) customize, tweak, and automate their work. We can all craft the systems we operate in. More and more system changers.
This is the real change software makes in the world.
The other day, I heard a story about a leadership retreat where the goal was to agree upon shared values. They held a vote, and lo, there was an even split between all the values. The group could not agree on which ones represented the company.
This makes sense. Our values are part of our identity. You can’t compromise on your values! That would make you a bad person!
How can we come together, then? Must we fire half the leadership and replace them with people willing to espouse the same values as the remaining half? or can we settle for mutual purpose, and go with a mission statement?
Corporate values are mostly generic. Honesty, integrity, innovation, teamwork, quality, respect, empathy. What do those even mean in the course of a workday?
To communicate that, you need to give examples of how a value is expressed in the workplace. For example:
Integrity: we do what we say we will do. When this becomes impossible, we communicate clearly and promptly with the people affected.
Honesty: we contribute our knowledge, especially about the limitations of our knowledge. We thank other people when they bring us surprising information, especially when it is bad news.
Innovation: we spend 80% of our time working within the system we have, and 20% of our work time improving the system we work in.
Collegiality: when we have disagreements, we resolve them peer-to-peer, not by appealing to higher management.
Once we’ve expressed what we mean with implementation strategies, let’s strike out the names of the value and call the list “agreed-upon methods.”
See, people can’t compromise on values. But we can compromise on methods.
“When they become a team, they do not necessarily expect everyone to adopt the same values and beliefs. In fact, they grow to value and respect the values and beliefs of the other people in the team.”
When we already share a common purpose — success in the company’s mission — we need to agree on how we will accomplish that. We need protocols to coordinate work and encourage dialog. But we don’t need to all emphasize the same basic human values. In fact, balance can help.
I’ve long preferred to work on teams that base cooperation on shared values like inclusiveness, respect, and curiosity. Teams where everyone feels valued at all times, both as a coworker and as a human.
Yet, the phrase “shared values” kinda gives me the willies. It implies that our compatibility as teammates comes from deep-seated beliefs. Which implies shared culture or even religion. I want to collaborate with people who are more different from me than that.
A teammate is anyone whose success is entwined with mine. We succeed together and fail together. This means we have a shared goal: a mutual purpose. We are going to the same place, regardless of where we each came from.
Yet mutual purpose is not enough for us to cooperate: we need to agree on how to achieve it. If we want to have dinner together, and you go to the grocery for ingredients while I book a restaurant, we are defeating each other. We need to agree that we want to share dinner and we will achieve this by cooking it, and this will require gathering ingredients and cleaning the kitchen. We can pair on tasks or divide them, but we can agree on what needs done. This way we value each other’s work at all times.
In software development, I want to agree on who our customer is and what we are doing for them. Everyone in the team shares this purpose. And I want to agree closely enough on how to do that. Like in science: an “invisible college” of collaborators exhibits shared purpose and practices.
For instance: on my project, we aim to change the way people deliver software. Our methods include:
provide an open-source software delivery machine framework, so people can write delivery in code.
delivering to npm, after passing our chosen checks, through code that we also control
collaborating asynchronously, with pull requests and in chat, so that people worldwide can participate. We each follow the code changes and a few specific channels.
with mutual respect and assumptions of positive intent. When we are confused by another’s actions, hop on a video call.
plus near-weekly planning meetings where we can go deep on what needs to change.
Agreed-upon methods incorporate: what solution we’re building, what we prioritize, some degree of architecture and implementation detail, our collaboration practices, and how we evolve these. The level of detail and which parts are important will vary by team, as needed for their purpose. It is important that they do vary; otherwise this is not a learning system. In our teams as in our technical architecture, the crucial question is “how do we change it?”
Some of these methods conflict with some deep-seated beliefs. Personal values can matter. For instance, if one of our methods is: “when we communicate, every team member’s opinion is heard” and some potential team member believes “the opinion of a woman is irrelevant” then no. That person’s deep-seated beliefs are in conflict with the agreed-upon methods of our team. If a potential team member believes that ignorant customers are to blame for any mistakes they make, that is in conflict with an agreed-upon method of discoverability.
Hiring becomes, will you adopt our mutual purpose? can you join in with our methods? and will you help us reach better methods in the future?
A person’s values might conflict with our team’s purpose or methods, but I don’t want to ask someone to hold particular values in order to work on my team.
Can we please replace “shared values” with “mutual purpose and agreed-upon methods”?
Symmathecist: (sim-MATH-uh-sist) an active participant in a symmathesy.
A symmathesy (sim-MATH-uh-see, coined by Nora Bateson) is a learning system made of learning parts. Software teams are each a symmathesy, composed of the people on the team, the running software, and all their tools.
The people on the team learn from each other and from the running software (exceptions it throws, data it saves). The software learns from us, because we change it. Our tools learn from us as we implement them or build in them (queries, dashboards, scripts, automations).
This flow of mutual learning means the system is never the same. It is always changing, and its participants are always changing.
An aggregate is the sum of its parts. A system is also a product of its relationships. A symmathesy is also powered by every past interaction.
I aim to be conscious of these interactions. I work to maximize the flow of learning within the system, and between the system and its environment (the rest of the organization, and the people or systems who benefit from our software). Software is not the point: it is a means, a material that I manipulate for the betterment of the world and for the future of my team.
I am a symmathecist, in the medium of software.
my “Origins of Opera” keynote (video or summary) introduces this term
Lately I’m working on our documentation. We write it in markdown, turn it into a web site, and then serve it from s3. To turn it into a web site, we use mkdocs and the material theme for mkdocs. Mkdocs is written in Python. Then we test it with HtmlProofer, which is in Ruby. Okay.
A week ago, I set out to add an “Edit on GitHub” link to each of our pages.
That’s built-in functionality in mkdocs; define a repo_url and an edit_uri in mkdocs.yml and it should just work. It didn’t work right away (although now I wonder whether I just missed the little pencil symbol because I was looking for text). Before I dug into figuring out why, I upgraded mkdocs and material because we were two breaking versions behind; the latest is 3.0.4 and we were on 1.0.4. If I’m gonna study a tool, I want to use the latest docs.
The broken links
The upgrade was no trouble as far as producing the site. HtmlProofer, though, found a bunch of broken links. To troubleshoot this, I went through pretty much all the docs on mkdocs and on material. (They’re beautiful docs; there’s a reason we use these tools.) Then dug around in the material templates and the mkdocs Python code. I created an issue on mkdocs (it’s fixed already!) and on material (the maintainer said thanks!) for the bug, and then worked around it by adding an exception to our HtmlProofer invocation, after looking at its docs to find out how to do it. Which required breaking our HtmlProofer call into its own script because we call it in three places.
After that, there was still a broken link. I diagnosed this one also as a bug that could be fixed either in mkdocs or the template, but didn’t have the heart to make another issue report. I worked around it instead, by overriding that page in the template. (Just now, having noticed how nice the maintainers were, I made the effort to create another issue. I even tried to make a PR but the build steps didn’t work on my computer. This is not a surprise.)
A brief interval of work that I wanted to do
Now the tests and build work, and the upgrade is done. After this, I had an adventure getting the edit link working without having a GitHub repo link in the upper-right corner (it was useless and animated, yuck). To have the “Edit on GitHub” I need repo_url defined in the config, but that always results in a repo link as well, so I had to override the entire header.html to remove that link. When we update the theme, that override will be out of date. I considered various ways to use Atomist to make sure I remember to do that, then settled for a detailed commit message.
By the end of the day, I had a PR in to our docs repository with the upgrade. Over the weekend, I got that reviewed, modified and merged into master.
In which I break the entire site
The master branch of this repository gets published to GitHub pages for a final review. It looked fine there, so I deployed it to s3. A few hours later, I stopped by docs.atomist.com, and oh no!
That is not what our site should look like! None of the styles are loading!
Is it me? is it everyone? is it my browser? I tried clearing some caches, I tried a few browsers, then I rolled the site back. Then forward, and looked at it some more, then back again. There were some 404s on CSS, but later there weren’t, and then everything was loading OK but still it looked like garbage. Computed styles showed nearly-empty in Firefox but had more stuff in Chrome (later, someone pointed out that Chrome has a bunch of default styles).
This was beyond my paltry web diagnostic skills. The next day I asked for help from the team, and Danny volunteered to be a second pair of eyes. We modified the build process to push to a subdirectory so that we could leave the working site up while inspecting the newer, nonworking version. Danny spotted that the CSS files were loading, but the content type in the headers was wrong. It should be text/css but is text/plain . So the browser is loading the file and then ignoring it with no error. 😠
Useful! I had already noticed that the CSS files had changed name formats after the upgrade. Instead of application-palette-f1354325.css (or something similar) we have application-palette.f231453.css. Ah-ha, what if the dot is causing something to think it is not a .css file but a .f231453.css file? I went looking for that. I searched source for application-palette and found it in the material theme.
I found application-palette.css in the src directory and application-palette-f1354325.css in the material directory, which (according to mkdocs.yml) is where the templates for the theme live. So something is adding that extra number, in some sort of build process. OK, how does it build? There’s a package.json so I check it for scripts. Sure enough, scripts.build contains a call to … make. OK, look for a Makefile. Yup, and that contains a call to … webpack. Gah! I’ve been avoiding webpack because I know it is deep. Where does it get its definition? probably that webpack.config.js file. Look in there, and it lists plugin after plugin, all of which are unfamiliar. Noooooo. But then I spot it! I found that stupid dot in the webpack.config.js … but changing that would mean rebuilding the theme, so I look for another way.
Which is good, because that wasn’t even the problem. Later I noticed that all the CSS files had the wrong content type, not just the ones with the dots. But I learned something, right? Right.
Next I searched for “s3 content type” since whatever makes a website available from s3 is sending these. That proved fruitful. The content type comes from metadata on s3, associated with each uploaded file. I opened the AWS console and looked in S3, found this bucket, found the CSS, looked at its metadata. Sure enough, it has a content-type element set to text/plain. So how does that get there?
Not pictured: at least half an hour of learning enough about the s3 command-line interfaces to be able to list metadata. (For the record, it’s aws s3api head-object --bucket my.bucket --key path/to/file ) This included some frustration of “why is it saying Forbidden when I clearly have read permissions” which resolved to “oh right, because I’m authenticated in this one terminal but not this other one.” The API is pretty complicated; there are two. The friendly one does not list metadata. But it did let me manually aws s3 sync some files up, which let me test more things. That program understands that .css files are text/css.
While I’m working on this, I post updates and notes to myself in Slack, in my jessitron-stream channel. David notices and contributes some history and some research. Our files get to s3 using s3cmd. That should be setting the content-type metadata. David remarked “There used to be complaints in the build logs that python-magic was not installed so s3cmd was going to guess the content type,” so he installed python-magic like 3 weeks ago to end that warning.
He linked to this issue: https://github.com/s3tools/s3cmd/issues/198 (and maybe there was another one?) and suggested adding --no-mime-type --no-guess-content-type to the s3 arguments. He also removed python-magic from the build. I tried those arguments. It complained about the first one being invalid (maybe because python-magic was gone?) so I removed it. The upload happened, but when I visited that version of the site, it asked me if I wanted to download this DMG! (I’m on a Mac. That would be a .exe on Windows.) The content-type of index.html was set to octet-stream. Um, no, that’s worse. Deeper in that issue thread I found a suggestion to use --guess-content-type and tried that. My commit message (on a branch) was “Wave the wand this direction” because this is spellcasting, not understanding.
Lo, it worked! Everything worked! I rebased those changes to get rid of all the intermediate things we tried, merged them to master and tagged the new version to trigger deploy. Hooray, we are able to update docs.atomist.com again!
A red herring
In Slack, we got a ping from our designer, who was having trouble building the site now. David and I were like, oh no, the upgrade broke something. Setting up the development environment for these docs, with Python and Ruby, is a pain. Ben was getting an error that resolved on Google to “wrong version of Python,” something about an exception in a loop which was a change between Python 3.6 and 3.7. Ben uninstalled and installed Python eight different ways. Both David and I joined a screenshare to help. Now nothing (including pip, Python’s package manager) can find the Python library zlib, which is a wrapper of a native libz library for compresssion (which is installed; he has xcode tools on his mac). This means pip can’t install packages, because it can’t unzip anything, including virtualenv, which we use to control the version of Python and of libraries. His machine is a giant circle of middle fingers.
I am not even gonna try to list the things we tried here. It was a mess. Homebrew was involved, and sudo rm. The worst part is, you know what the problem was? He hadn’t updated mkdocs. He hadn’t pulled the master branch. What he had done was upgrade Python, which didn’t work with the old version of mkdocs but did work with the new! This was a few hours of all of our lives we would like to have back.
Not so fast
But this story is not over! Oh, no. The next day, some people complained on our Slack that the docs site was not loading. They were seeing the unstyled garbage. Clear the browser caches, same problem. David went to our CDN, CloudFront, and told it to not cache these things, and to manually refresh the caches. But NO. Somewhere in the bowels of the internet, bad content-types are cached for these CSS files. The files haven’t changed, so the caches decline to refresh. They do not notice that the content-type has changed. Having been bad for an hour or two, those files are now bad for some unknown amount of time, only to some people. The URLs are cursed.
The only thing to do is to rename them. I write a script that renames all the .css files and changes all references to them. I kluge that into our build process, after we build the site and before we copy it up to s3. This works.
OK. Incident over, as far as we can tell.
There is no such thing as “root cause” in systems this complex. There are “conditions that allowed this to be a problem.” And crucially, there are many conditions and actions that kept it from being worse. Eliminating the former, trying to “make sure this never happens again” is Safety-I. Amplifying the latter, sharpening our vision into potential problems strengthening our ability to solve them, is Safety-II. In this analysis, I’ll remark briefly on the Safety-I sources of problems and more extensively on Safety-II sources of resilience.
How did this happen?
It seems likely that adding the python-magic library contributed. That changed the behavior of s3cmd, except that it didn’t show until new files were created. So s3cmd’s behavior of not updating the metadata on files that already exist made this problem into a sneaky lurking one, dark debt.
Ironically, that library was added to reduce debt, as a response to this line in the build:
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
It turns out guessing MIME types based on file extensions is a great way to do it. It’s great because it’s predictable by humans, a key property of collaborative automation. This beats clever-but-unpredictable magic.
The upgrade of mkdocs and material did trigger the problem, because it met the necessary condition of adding new files. It’s tempting to avoid upgrades, because upgrades often trigger latent problems, just like this. But upgrades also remove problems, like Ben’s upgrade to Python 3.7.
How did it get so bad?
We didn’t know before deploying the live site that this wasn’t going to work. Later, we had to develop a way of deploying to s3 without overwriting the live site in order to diagnose it. I also didn’t notice immediately that it was broken, so the site was garbage for a few hours … long enough for some CDN nodes to grab and keep the evil content-types.
Gosh, there was so much.
I checked the site at all. That was deliberate.
Communication and people: I couldn’t figure this out without Danny and David. Our daily standup, Zoom, and Slack were essential collaboration tools. Our notifications from Atomist in the #docs channel of Atomist community Slack helped us see what the others were doing.
This is not a checklist. This is a set of learnings that affect our priorities for future work.
We want to see the rendering of a version of the site on s3 before we release it. This is something to build in the Atomist SDM I’m making for this site, which will replace the inflexible Travis build.
I want to check the site after each release. I can make my SDM send me a direct message whenever one is done.
We could switch from s3cmd to aws s3 sync for more predictable behavior, testable on more systems.
Setting up the right versions of Python and Ruby on a local computer is bad. I want a development process that uses Docker for isolation. Ideally, an SDM that runs locally in Docker. (That’s already been on my list, and now it seems more important.)
Software Development has moved forward a lot recently. Both on the code side and the runtime side, we’ve had huge advances.
Then on the runtime side, we don’t have to deploy to hardware anymore. Virtual machines were a step, and then the cloud, and now Kubernetes is a big deal. Kubernetes is a big deal because it’s higher level of abstraction, where developers can work in the language of the domain of software. But that’s not all: Kubernetes also offers an API, which means we can work with it using code.
We can do more with code now on both sides, thanks to expressive frameworks and to an API for hardware. But there’s something in the middle lagging behind.
The bridge from source code to running software is Delivery. Delivery has made advances: we went from Bash and make to pipelines. But pipelines have been around for a decade, and since then we’ve got … morepipelines. It’s time for the next revolutionary step.
This is Atomist’s mission: to change the way we deliver software. Not to make it incrementally better, but to rethink delivery the way we’ve rethought application architecture and runtime infrastructure.
The way forward is not more pipelines. Nor is it event more bash scripts or configuration that wishes it were code. The way forward is not updated separately for each delivered application.
The way forward is to do more with code. In a real programming language. The way forward responds to events in the domain of software delivery: to code pushes, and also tickets, pull requests, deploys, builds, and messages in chat. It responds in context: when our delivery machine decides (in code!) what to do with a particular change, it does so with awareness of what the code says, of who changed it, in response to what ticket. It responds in communication: when in doubt, contact the responsible human and ask them whether they’d like to restart the build, submit the pull request, or deploy to production.
As we succeed, our systems increase in complexity. The systems to control them need to be at least as smart. We need more power in our delivery than any GUI screen or YAML file can give us. We have that power, when we craft our delivery in code on a strong framework with a domain-specific API.
Every company is in the software delivery business now. Let’s take it seriously, the same way we do our code and our production runtime.
Yesterday I was ready to get some changes into master, so I merged in the latest and opened a PR. But NO, the build on my pull request broke. The error was:
ERROR: (jsdoc-format) /long/path/to/file.ts[52, 1]: asterisks in jsdoc must be aligned ERROR: (jsdoc-format) /long/path/to/file.ts[53, 1]: asterisks in jsdoc must be aligned ERROR: (jsdoc-format) /long/path/to/file.ts[54, 1]: asterisks in jsdoc must be aligned ERROR: (jsdoc-format) /long/path/to/file.ts[55, 1]: asterisks in jsdoc must be aligned … fifty more like that …
gah! someone added more tslint rules, and this one doesn’t have an automatic fix. Maybe someone upgraded tslint, its “recommended” ruleset changed, and bam, my build is broken.
For measley formatting trivia.
/** * oh no! This JSDoc comment is not aligned perfectly! * The stars are supposed to have one more space before them * so they all line up under the first top one * * The world will end! Break the build! * * @param likeItMatters a computer could fix this grrr * @deprecated */ whatever(likeItMatters: Things): Stuff;
Look, I’m all for consistent code formatting. But I refuse to spend my time adding a space before fifty different asterisks. Yes I know I can do this with fewer keystrokes using editor acrobatics. I refuse to even open the file.
You know who’s good at consistency? Computers, that’s who. You want consistent formatting? Make a robot do it.
So I made a robot do it. In our Software Delivery Machine, we have various autofixes: transformations that run on code after every commit. If an autofix makes a change, the robot makes a commit like a polite person would. No build errors, just fix it thanks.
I wrote this function, which does a transformation on the code. The framework takes care of triggering, cloning the repository, and committing the result.
You can do this too. You can do it on your local machine with the fully open-source Local Software Delivery Machine. Fix your code on demand, or in response to each commit. Write your own functions to keep your code looking the way you like. Never be embarrassed by the same mistake again!
To help you try it out, I added my autofix to an otherwise-empty Software Delivery Machine.
Go to that location: cd $ATOMIST_ROOT/atomist-blogs/align-stars-sdm
Now for the slow part: npm install
Start up your SDM: atomist start --local The SDM is a process that hangs out on your computer waiting to help. It swings into action when triggered by the atomist command line, or by commits inside $ATOMIST_ROOT.
Optional: in another terminal, run atomist feed. This will give you a running summary of what your SDM is up to.
Now screw up some formatting. I’ve left some nice jsdoc comments in lib/autofix/alignStars.ts; move those stars over a little.
Save and make a commit: git commit -am “Oh no, misalignment”
Check the output in your atomist feed, and you should see that an Autofix has run. (You can also type atomist align the stars to do this specific transform, in a repository not wired up to trigger you SDM.)
Check your commit history: git log. Did Atomist make a commit? Check its contents: git show and you should see the stars moved into alignment.
OK that was a lot just to format some comments. The important part is we can write functions to realize policy. If a person complained about this formatting I’d tell them to f*** off or fix it themselves — in fact I did curse extensively at tslint when it gave me these errors. Tslint didn’t care. Programs aren’t offended.
If I want my teammates’ code to conforms to my standards, it’s rude to ask them to be vigilant about details. Aligning asterisks — I mean, some people like it, and that’s fine, I kinda enjoy folding laundry — but as a developer that’s a crummy use of my attention. Computers, though! Computers are great at being vigilant. They love it. My Atomist SDM sits there eagerly awaiting commits just to dig around in those asterisks in the hope of fixing some.
Please make my star-alignment into a code transform that’s useful to you. I went with plain string parsing here, in a very functional style for my own entertainment. We also have clever libraries for working with the compiled AST and more (watch this space).
There’s more: an SDM running in the cloud listens to GitHub (GitLab, BitBucket, GHE) and applies autofixes to everyone’s commits. And code reviews. And runs or triggers build (but only when it’s worthwhile; it looks at what’s changed). And initiates deploys (except it asks us first in Slack). There’s no setting up a pipeline for a new repository or branch; our SDM knows what to do with a code push based on the code in it.
There’s more: an SDM in the cloud also listens to issues, pull requests, builds, deploys, and other events in the software domain. It can react to all of them by talking to people in Slack, or running any other program. Whatever we do that is boring, we can automate.
This is our power as software developers. We don’t need someone to write a GUI we can click in. We don’t need to configure in YAML. We can specify what needs to happen declaratively in code.
All we needed was a framework to do the common glue-y bits for us, like triggering on events, cloning repositories, passing our functions the data they need and carrying out our actions.
The autofix in this example triggers on commit, clones the repository, passes a Project object in for a code transform function to act on, commits those changes and pushes them back to where I’m working. The framework of the SDM lets me define my own policies and implement them in code. Then the machinery of the SDM keeps them running, locally (open source) or team- or organization-wide (using Atomist’s API).
The inaugural REdeployConf wrapped up yesterday (as I write this). I’m already feeling withdrawal from intense learning and conversations. I’ll attempt to summarize them in this post.
The RE in REdeploy doesn’t mean “again” (lo, it is the first of its kind). RE stands for Resilience Engineering. It is a newish field, focused on sociotechnical systems that continue to function in shifting, surprising, always-failing-somewhere conditions (aka, reality).
John Allspaw opened the conference with: resilience is in the humans. Your software might be robust, but in the end, it does what it was told. Only humans respond in new ways to new situations. People can be prepared to be unprepared.
Resilience is the antidote to complexity. Except not a full antidote: the complexity is still there. It just doesn’t kill us. Complexity is not avoidable, because success begets complexity. A successful system has impact, and impact means interdependence, and interdependence means complexity.
What is resilience? Laura Maguire enumerated some definitions. Rebound, robustness, and graceful extensibility are partial definitions that build into the real one: Resilience is sustained adaptive capacity. It’s the ability to find new abilities, to change in response to changing conditions to maintain functioning. Resilient systems are not the same moment to moment, but they keep fulfilling their purpose (even as their purpose morphs).
Resilience Engineering is not a computer science discipline. It’s broader than that. Industries like nuclear power and air traffic control have deeper roots in the study of coping with failure. This isn’t your old-school Root Cause Analysis that asked “why did this fail?” This is systems thinking, asking “how does this succeed?” How do systems constantly subject to new failures keep running anyway? (hint: people.)
Avery Regier pointed out that root cause analysis can prevent a specific failure from recurring. But we find new failures all the time. Some new service is going to run out of space. Some new query is going to be slow. Some new customer is going to call a new API a whole lot more than we expected. Prevention is never going to cut it, so don’t spend all your resources there. Grow your powers of recovery, and you mitigate whole classes of failures.
Resilience Engineering recognizes that our systems include software and humans, so half the talks were about code and half about people. Matty Stratton extended trauma therapy to organizations, and Lee Kussmann gave strategies for personal resilience to stress (notes for both). On the code side, Cici Deng spoke about making safer changes at AWS Lambda: like most things in this science, improvement isn’t having the right answers — it’s asking better questions (notes). Aaron Blohowiak talked about speeding recovery and isolating failure domains at Netflix. Then Hannah Foxwell on HumanOps: there is no failover for You. People are more difficult to work with than software, so start there. (notes for both)
Mary Thengvall and J Paul Reed organized this conference to beget conversations, to seed a community in this space. Existing communities exist around the SNAFUcatchers and the Lund program. This new one is an open, informal camerata of people who care about resilience in humans+computer systems within the software industry.
They succeeded! The conference was a conversation: speakers referred back to prior talks. Mary and Paul emceed with commentary before and after every talk, weaving them together, sharing their reactions and enthusiasm. At the end of each day, the speakers turned into a panel for Q&A. The questions drew from and among all the talks.
Liz asked, how can we move an organization toward resilience from the bottom? Matt and Cici went back and forth over “use data” and “data won’t convince some people.” Any solution must be opt-in, and then you need to collect stories. Stories move people. When every system is different, stories are what we have. We can’t do controlled experiments. What we can do is: dig into those stories to find the causes of success. This is what researchers like Laura Maguire do.
In one of the last questions, someone asked, “Where is accountability in all this?” Cici said, we have tons of talk about accountability in our culture already. I agree; every movement is relative to the culture it is moving. Other answers suggested: Accountability is assumed, not assigned. Personal theroy: maybe accountability at the individual-human level is too narrow for the larger networks that we require to work with systems of complexity larger than a personbyte. MAYBE teams need to be accountable for working safely and effectively, and people need to be accountable to their teams.
Aaron had a lovely rant during Q&A about the “sufficiently smart engineer.” This is the hypothetical engineer who would not make such mistakes. Who would understand the existing system thoroughly. This person is a myth. Our software is too complex for one person to hold in their head. You can’t hire a sufficiently smart engineer, and don’t feel bad that you aren’t one, because it’s not a thing. Instead, we need to build systems that support our own cognitive work.
Resilience Engineering is a new science. Its research does not take place in a lab, but in the field. “We refuse to simplify.” Laura Maguire closed with a description of next steps in research. In our own jobs, we can do resilience engineering by looking for who and what makes us more safe (learn from success), by keeping the messy details instead of seeking a clean story, and by maximizing for learning in our symmathesy-teams (including software, tools, and people). For instance, when you find a “root cause” of a failure, look for other situations when that trigger occurred and failure didn’t.
Other fun stuff:
We witnessed the first open-source releases from Deere and Co.
Paul Carleton told a story of Stripe’s journey from “We should restart old EC2 instances” to “Oh look, we’re chaos engineers now.” Matt Broberg told a scary story about stopping forward motion, about ⟳technical debt and social debt⟲ at Sensu, and the perils of IRC. (notes for Matt, Paul, and Laura)
Atomist sponsored — I hope we can sponsor every edition of this conference! We work on tools to help developers integrate the social and technical parts of our systems, so it’s relevant. This was our first lanyard sponsorship and they were beautiful, in my very biased opinion.
Yesterday (as I write this) we recorded a >Code episode (#95) with Heidi Waterhouse, and she and I brought up topics from REdeploy about a dozen times. Me: “This conference is going to keep coming up, over and over, for the rest of my life.”
In Why Information Grows (my review), physicist César Hidalgo explains that the difference between the ability to produce tee shirts vs rockets is a matter of accumulating knowledge and know-how inside people, and weaving those people into networks. Because no one person can know how to build a rocket from rocks. No one person understands how a laptop computer works, at all levels. There’s a limit to the amount of knowledge a single person can cram into their head in a lifetime; Hidalgo calls this limit one personbyte.
Our software systems are more complicated than one person can hold. You can deal with this by making each system smaller (hello microservices), but then someone has to understand how they fit together. And someone cannot. It takes teams of people.
You can deal with the personbyte limitation by forming teams that work together closely. The more complex the product we’re building, the more knowledge we need, so the less duplicated knowledge we want on the team. Team size is limited too, because of coordination overhead.
If you think about it this way, you might recognize that in many software teams, our limitation is not how much we can do, but how much we can know. To change a sufficiently complex system, we need more knowledge than one or two people can hold. Otherwise we are very slow, or we mess it up and the unintended effects of our change create a ton more work.
If the limitation of my team is how much we can know, then mob programming makes a ton of sense. In mob programming, the whole team applies their knowledge to the problem, and each person takes turns typing. This way we’re going to do the single most important thing in the most efficient and least dangerous way our collective knowledge and know-how can provide. In the meantime we spread knowledge among the team, making everyone more effective.
If your software system has reached the point where changing it is one step forward, two steps back — it might be a good time to try working as one unit, mob-programming style.
The other day in Iceland, a tiny conference on the Future of Software Development opened with Michael Feathers addressing a recurring theme: complexity. Software development is drowning in accidental complexity. How do we fight it? he asks. Can we embrace it? I ask.
Complexity: Fight it, or fight through it, or embrace it? Yes.
Here, find tidbits from the conference to advance each of these causes, along with photographs from a beautiful cemetery I walked through in Reykjavik.
One way we resist complexity: keep parts small, by creating strong boundaries and good abstractions. Let each part change separately. The question is, what happens outside these boundaries?
Feathers complained that developers spend about ten percent of their time coding, and the rest of it configuring and hooking together various tools. This makes sense to me; we’ve optimized programming languages and libraries so that coding takes less time, and we’ve built components and services so that we don’t have to code queuing or caching or networking or databases. Hooking these together is our job as software developers. Personally, I want to do more of that work in code instead of screens or configuration, which is part of my mission at Atomist. Josh Stella of Fugue also says we should be programming the cloud, not configuring it.
Paul Biggar at Dark has another way to attack complexity: wall it off. Cross the boundaries, so that developers don’t have to. Or as Jason Warner put it, “do a lot more below the waterline.” The Dark programming system integrates the runtime and the database and the infrastructure and the code, so that system developers can respond to what happens in the real world, and change the whole system at once. This opens backend development to a whole realm of people who don’t have time to learn the a dozen parts and their interconnections. The Future of Software Development is: more people will be doing it! People whose primary job is not coding.
In any industry, we can fight complexity through centralization. If everyone uses GitHub, then we don’t have to integrate with other source code management. Centralization is an optimization, and the tradeoffs are risk and stagnation. Barriers to entry are high, options are limited, and growth is in known dimensions (volume) not new ones (ideas).
Decentralization gives us choices, supports competing ideas, and prevents one company from have enough power to gain all the power. Blockchain epitomizes this. As Manuel Araoz put it: “Blockchain is adding intentional inefficiency to programming” in order to prevent centralization.
Centralization is also rear-facing: this thing we know how to do, let’s do it efficiently. Decentralization is forward-facing: what do we not yet know how to do, but could?
Building one thing very well, simply, is like building a stepping stone through the water we’re drowning in. But stones don’t flow. Exploration will always require living in complexity.
Fight through it
Given that complexity surrounds us, as it always will when we’re doing anything new, can we learn more ways to cope with it?
To swim forward in this complexity, we need our pieces to be discoverable, to be trouble-shoot-able, and to be experiment-with-able.
Keith Horwood from stdlib is working on the democratization of APIs, those pieces of the internet that make something happen in the real world. They’re making APIs easy to document and standardize. Stdlib aims to supplement, not replace developer workflows: the tools pile higher, and this is normal. Each tool represents a piece of detailed know-how that we can acquire without all the details.
Keith Horwood and Rishabh Singh both pointed out that humans/programmers go from seeing/reading, to speaking/writing, and then to executing/programming: we observe the world, we speak into the world, and then we change the world. (I would argue that we hop quickly to the last step, both as babies and as developers.) To learn how to change a complex system is to change it, see what happens, change it again.
We use type systems and property tests to reason about what can’t happen. Example tests and monitoring reassure us what happens in anticipated circumstances. We get to deduce what really does happen from observability and logs.
If we accept that we are part of this complex system that includes our code, perhaps we can find buoyancy: we can sail instead of drown.
Complexity is not anarchy; when it gels, a complex system is a higher form of order. It is an order that operates not in linear deductions, but in circles and spirals. These circles operate both within the system and between the system and its environment.
Feathers and I both spoke about the symbiosis of a development team with its code and its tools. I call it a symmathesy. We learn from our code, from the clues it leaves us in data and logs; and our code learns from us, as we change it. Both these forms of communication happen only through other software: observability tools to see what is happening in the software, and delivery tools to change what will happen. Once we view the system at this level, we can think about growing our whole team: people, running software, tools for visibility and control.
Rishabh Singh, Miltos Allamanis, and Eran Yahav showed machine-learning backed tooling that makes programs that offer useful suggestions to humans who are busy instructing the computer. The spiral goes higher.
Kent Beck said that nothing has higher leverage than making a programming system while using those same tools to build a system. His talk suggested that we: (1) make very small changes in our local system; (2) let those changes propagate outwards gradually; and (3) reflect on what happened, together. We learn from the system we are changing, and from each other.
McLuhan’s Law: We shape our tools, and then our tools shape us.
Our tools don’t shape our behavior violently and inflexibly, the way rules and punishment do. They shape us by changing the probability of each behavior. They change what is easy. This is part of my mission at Atomist: enable more customization of our own programming system, and more communication between our tools and the people in the symmathesy.
As developers, we are uniquely able to shape our own world and therefore ourselves by changing our tools. Meanwhile, other people are gaining some of this leverage, too.
I believe there will be a day when no professional says “I can’t code” — only “coding is not my specialty.” Everyone will write small programs, what Ben Scofield calls idiomatic software, what I call personal automation. These programs will remain entwined with their author/users; we won’t pretend that the source code has value outside of this human context. (Nathan Herald had a good story about this, about a team that tried to keep using a tool after the not-professional-developer who wrote it left.)
This is a problem in development teams, when turnover is high. “Everyone who touches it is reflected in the code.” (who said that? Rajeev maybe?) I don’t have a solution for this.
The path forward includes more collaboration between humans and computers, and between computers and each other, guided by humans. It includes building solid steps on familiar ground, swimming lessons for exploration, and teamwork in the whole sociotechnical system so that we can catch the winds of complexity and make them serve us.