Tiny dramas, tiny deploys

It is better to practice risky things often and in small chunks with a limited blast radius, rather than to avoid risky things.

Charity Majors, “Test in production? Yes

Charity is writing about deploys. Not-deploying may be safer for tonight, but in the medium term it leads to larger deploys and bigger, trickier failures.

In the long term, slow change means losing relevance and going out of business.

In relationships, the same applies. If I have some feeling or fact that my partner might not like, I can say it or not. It never feels like the right time to say it. There is no “right time,” there is only now. There is positive reinforcement for holding back, because then our evening continues pleasantly. No drama.

This leads to an accumulation of feelings and facts they don’t know about. Then when it does become urgent to talk about those, they react with feelings of betrayal: Why didn’t you tell me about this sooner?

In the long term, lack of sharing means growing apart and breaking up.

My new strategy in relationships is: tiny dramas, all the time. The more tiny dramas we have, the fewer big dramas. Also we get practice at handling drama in a way that is safe, because it’s minor. I take any mental question of “should I say this?” as a clue, an opportunity! Yes, say it. Unless it’s a really bad time, it’s the best time.

And the complementary strategy: whenever my partner tells me something scary, like something I did that they don’t like or some feeling they had that might upset me, my first response is “Thank you.” Usually it is not a drama anyway, it’s fine. When I do have feelings about it, we can talk about them. Reassurance helps a lot, especially when I recognize and appreciate the risk they took by telling me in this moment.

If a small deploy causes failure, please respond with “Thank you for not making this part of a bigger deploy.”

We have built a glass castle, where we ought to have a playground.

Charity again, on our lack of safe tooling and therefore fear of production

Code and Coders: components of the sociotechnical system

TL;DR: Study all the interactions between people, code, and our mental models; gather data and we can make real improvements instead of guessing in our retros.

Software is hard to change. Even when it’s clean, well-factored, and everyone working on it is sharp and nice. Why?

Consider a software team and its software. It’s a sociotechnical system; people create the code and the code affects the people.

a blob of code and several people, with two-way arrows between the code and the people and the people

When we want to optimise this system to produce more useful code, what do we do? How do we make the developer->code interactions more productive?

the sociotechnical system, highlight on each person

As a culture, we started by focusing on the individual people: hire those 10x developers! As the software gets more complex, that doesn’t go far. An individual can only do so much.

the sociotechnical system, highlight on the arrows between peopleThe Agile software movement shifted the focus to the interactions between the people. This lets us make improvements at the team level.
the sociotechnical system, highlight on the blob of codeThe technical debt metaphor let us focus on how the code influences the developers. Some code is easier to change than other code.

We shape our tools, and thereafter our tools shape us. – McCluhan

the sociotechnical system, highlight on the arrows reaching the codeTest-driven development focuses on a specific aspect of the developercode interaction: tightening the feedback loop on “will this work as I expected?” Continuous Integration has a similar effect: tightening the feedback loop on “will this break anything else?”

All of these focuses are useful in optimizing this system. How can we do more?

Thereʼs a component in this system that we haven’t explicitly called out yet. It lives in the heads of the coders. Itʼs the developerʼs mental model of the software.

a blob of code and two people. The people have small blobs in their heads. two-way arrows between the code and the small blobs, and between the people
Each developerʼs mental model of the software matches the code (or doesn’t)

Every controller must contain a model of the process being controlled.
Nancy Leveson, Engineering a Safer World

When you write a program, you have a model of it in your head. When you come to modify someone else’s code, you have to build a mental model of it first, through reading and experimenting. When someone else changes your code, your mental model loses accuracy. Depending on the completeness and accuracy of your mental model of the target software, adding features can be fun and productive or full of pain.

Janelle Klein models the developer⟺code interaction in her book Idea Flow.  We want to make a change, so we look around for a bit, then try something. If that works, we move forward (the Confirm loop). If it doesn’t work, we shift into troubleshooting mode: we investigate, then experiment until we figure it out (the Conflict loop). We update our mental model. When weʼre familiar with the software, we make forward progress (Confirm). When weʼre not, pain! From the book:

to make a change, start with learn; modify; validate. If the validation works, Confirm! back to learn. If the validation is negative, Conflict! on to troubleshooting; rework; validate.

That 10x developer is the one with a strong mental model of this software. Probably they wrote it, and no one else understands it. Agile (especially pairing) lets us transfer our mental model to others on the team. Readable code makes it easier for others to construct an accurate mental model. TDD makes that Confirm loop happen many more times, so that Conflict loops are smaller.

We can optimize this developer⟺code interaction by studying it further. Which parts of the code cause a lot of conflict pain? Focus refactoring there. Who has a strong mental model of each part of the system, and who needs that model? Pair them up.

Idea Flow includes tools for measuring friction, for collecting data on the developer⟺code interaction so we can address these problems directly. Recording the switch from Confirm to Conflict tells us how much of our work is forward progress and how much is troubleshooting, so we can recognize when we’re grinding.

Even better, we have data on the causes of the grinding.

We can reflect and choose actions based on what’s causing the most pain, rather than on gut feel of what we remember on the day of the retrospective.

Picturing those internal models as part of the sociotechnical system changes my actions in subtle ways. For instance I now:

  • observe which of my coworkers are familiar with each part of the system.
  • refactor and then throw it away, because that improves my mental model without damaging anyone else’s.
  • avoid writing flexible code if I don’t need it yet, because alternatives inflate the mental model other people have to build.
  • spending more time reviewing PRs in order to keep my model up-to-date.

We can’t do this by focusing on people or code alone. We have to optimize for learning. Well-factored code can help, but it isn’t everything. Positive personal interactions help, but they aren’t everything. Tests are only one way to minimize conflict. No individual skill or familiarity can overcome these challenges.

If we capture and optimize our conflict loops, consciously and with data, we can optimize the entire sociotechnical system. We can make collaborative decisions that let us change our software faster and faster.


How important is correctness?

This is a raging debate in our industry today. I think the answer depends strongly on the kind of problem a developer is trying to solve: is the problem contracting or expanding? A contracting problem is well-defined, or has the potential to be well-defined with enough rigorous thought. An expanding problem cannot; as soon as you’ve defined “correct,” you’re wrong, because the context has changed.

A contracting problem: the more you think about it, the clearer it becomes. This includes anything you can define with math, or a stable specification: image conversion, what do you call it when you make files smaller for storage. There are others: ones we’ve solved so many times or used so many ways that they stabilize: web servers, grep. The problem space is inherently specified, or it has become well-defined over time.
Correctness is possible here, because there is such a thing as “correct.” Programs are useful to many people, so correctness is worth effort. Use of such a program or library is freeing, it scales up the capacity of the industry as a whole, as this becomes something we don’t have to think about.

An expanding problem: the more you think about it, the more ways it can go. This includes pretty much all business software; we want our businesses to grow, so we want our software to do more and different things with time. It includes almost all software that interacts directly with humans. People change, culture changes, expectations get higher. I want my software to drive change in people, so it will need to change with us.
There is no complete specification here. No amount of thought and care can get this software perfect. It needs to be good enough, it needs to be safe enough, and it needs to be amenable to change. It needs to give us the chance to learn what the next definition of “good” might be.

I propose we change our aim for correctness to an aim for safety. Safety means, nothing terrible happens (for your business’s definition of terrible). Correctness is an extreme form of safety. Performance is a component of safety. Security is part of safety.

Tests don’t provide correctness, yet they do provide safety. They tell us that certain things aren’t broken yet. Process boundaries provide safety. Error handling, monitoring, everything we do to compensate for the inherent uncertainty of running software in production, all of these help enforce safety constraints.

In an expanding software system, business matters (like profit) determine what is “good enough” in an expanding system. Risk tolerance goes into what is “safe enough.” Optimizing for the future means optimizing our ability to change.

In a contracting solution, we can progress through degrees of safety toward correctness, optimal performance. Break out the formal specification, write great documentation.

Any piece of our expanding system that we can break out into a contracting problem space, win. We can solve it with rigor, even make it eligible for reuse.

For the rest of it – embrace uncertainty, keep the important parts working, and make the code readable so we can change it. In an expanding system, where tests are limited and limiting, documentation becomes more wrong every day, the code is the specification. Aim for change.

Property Testing in Elm

Elm is perfectly suited to property testing, with its delightful data-in–data-out functions. Testing in Elm should super easy.

The tooling isn’t there yet, though. This post documents what was necessary today to get a property to run in Elm.

Step 1: elm-test

This includes an Elm library and a node module for a command-line runner. The library alone will let you create a web page of test results and look at it, but I want to run them in my build script and see results in my terminal.

Installation in an existing project:

elm package install deadfoxygrandpa/elm-test
npm install -g elm-test

The node module offers an “elm test init” functionality to put some test files in the current directory: TestRunner (which is the Main module for test runs[1]) and Tests.elm which holds actual tests. Personally, I found it necessary to follow the following steps as well.

  • create a test directory (I don’t want tests in my project home), and move the TestRunner.elm and Tests.elm files there.
  • add that test directory to the source directories in elm-package.json

Step 2: elm-check

The first thing to know is: which elm-check to install. You need the one from NoRedInk:

elm package install NoRedInk/elm-check

The next thing is: what to import. Where do all those methods used in the README live?

Here is a full program that lets elm-test execute the properties from the elm-check readme.
TL;DR: You need to import stuff from Check and Check.Producer for all properties; and  for the runner program, ElmTest and Check.Test and Signal, Console, and Task.

Name it test/Properties.elm and run it with

elm test test/Properties.elm

The output looks like

Successfully compiled test/Properties.elm
Running tests…
  1 suites run, containing 2 tests
  All tests passed

Here’s the full text just in case.

module Main (..) where
import ElmTest
import Check exposing (Evidence, Claim, that, is, for)
import Check.Test
import Check.Producer as Producer
import List
import Signal exposing (Signal)
import Console exposing (IO)
import Task

console : IO ()
console =
  ElmTest.consoleRunner (Check.Test.evidenceToTest evidence)

port runner : Signal (Task.Task x ())
port runner =
  Console.run console

myClaims : Claim
myClaims =
    “List Reverse”
    [ Check.claim
        “Reversing a list twice yields the original list”
        `that` (\list -> List.reverse (List.reverse list))
        `is` identity
        `for` Producer.list Producer.int
    , Check.claim
        “Reversing a list does not modify its length”
        `that` (\list -> List.length (List.reverse list))
        `is` (\list -> List.length list)
        `for` Producer.list Producer.int

evidence : Evidence
evidence =
  Check.quickCheck myClaims

How to write properties is a post for another day. For now, at least this will get something running.

See also: a helpful post for running elm-check in phantom.js

[1] How does that even work? I thought modules needed the same name as their file name. Apparently this is not true of Main. You must name the module Main. You do not have to have a ‘main’ function in there (as of this writing). The command-line runner needs the ‘console’ function instead.

Ultratestable Coding Style

Darn side-effecting programs. Programs that change things in the outside world are so darn useful, and such a pain to test.
what's better than green? Ultra!For every piece of code, there is another piece of code that answers the question, “How do I know that code works?” Sometimes that’s more work than the code itself — but there is hope.

The other day, I made a program to copy some code from one project to another – two file copies, with one small change to the namespace declaration at the top of each file. Sounds trivial, right?

I know better: there are going to be a lot of subtleties. And this isn’t throwaway code. I need good, repeatable tests.

Where do I start? Hmm, I’ll need a destination directory with the expected structure, an empty source directory, files with the namespace at the top… oh, and cleanup code. All of these are harder than I expected, and the one test I did manage to write is specific to my filesystem. Writing code to verify code is so much harder than just writing the code!

Testing side-effecting code is hard. This is well established. It’s also convoluted, complex, generally brittle.
The test process looks like this:

input to code under test to output, but also prep the files in the right place and clear old files out, then the code under test does read & write on the filesystem, then check that the files are correct

Before the test, create the input AND go to the filesystem, prepare the input and the spot where output is expected.
After the test, check the output AND go to the filesystem, read the files from there and check their contents.
Everything is intertwined: the prep, the implementation of the code under test, and the checks at the end. It’s specific to my filesystem. And it’s slow. No way can I run more than a few of these each build.

The usual solution to this is to mock the filesystem. Use a ports-and-adapters approach. In OO you might use dependency injection; in FP you’d pass functions in for “how to read” and “how to write.” This isolates our code from the real filesystem. Test are faster and less tightly coupled to the environment. The test process looks like this:

Before the test, create the input AND prepare the mock read results and initialize the mock for write captures.
After the test, check the output AND interrogate the mock for write captures.

It’s an improvement, but we can do better. The test is still convoluted. Elaborate mocking frameworks might make it cleaner, but conceptually, all those ties are still there, with the stateful how-to-write that we pass in and then ask later, “What were your experiences during this test?”

If I move the side effects out of the code under test — gather all input beforehand, perform all writes afterward — then the decisionmaking part of my program becomes easier and more clear to test. It can look like this (code):

The input includes everything my decisions need to know from the filesystem: the destination directory and list of all files in it; the source directory and list plus contents of all files in it.
The output includes a list of instructions, for the side effects the code would like to perform. This is super easy to check at the end of a test.

The real main method looks different in this design. It has to gather all the input up front[1], then call the key program logic, then carry out the instructions. In order to keep all the decisionmaking, parsing, etc in the “code under test” block, I keep the interface to that function as close as possible to that of the built-in filesystem-interaction commands. It isn’t the cleanest interface, but I want all the parts outside “code-under-test” to be trivial.

simplest possible code to gather input, to well-tested code that makes all the decisions, to simplest-possible code to carry out instructions.

With this, I answer “How do I know this code works?” in two components. For the real-filesystem interactions, the documentation plus some playing around in the REPL tell me how they work. For the decisioning part of the program, my tests tell me it works. Manual tests for the hard-to-test bits, lots of tests for the hard-to-get-right bits. Reasoning glues them together.

Of course, I’m keeping my one umbrella test that interacts with the real filesystem. The decisioning part of the program is covered by poncho tests. With an interface like this, I can write property-based tests for my program, asserting things like “I never try to write a file in a directory that doesn’t exist” and “the output filename always matches the input filename.”[2]

As a major bonus, error handling becomes more modular. If, on trying to copy the second file, it isn’t found or isn’t valid, the second write instruction is replaced with an “error” instruction. Before any instructions are carried out, the program checks for “error” anywhere in the list (code). If found, stop before carrying out any real action. This way, validations aren’t separated in code from the operations they apply to, and yet all validations happen before operations are carried out. Real stuff happens only when all instructions are possible (as far as the program can tell). It’s close to atomic.

There are limitations to this straightforward approach to isolating decisions from side-effects. It works for this program because it can gather all the input, produce all the output, and hold all of it in memory at the same time. For a more general approach to this same goal, see Functional Programming in Scala.

Moving all the “what does the world around me look like?” side effects to the beginning of the program, and all the “change the world around me!” side effects to the end of the program, we achieve maximum testability of program logic. And minimum convolution. And separation of concerns: one module makes the decisions, another one carries them out. Consider this possibility the next time you find yourself in testing pain.

The code that inspired this approach is in my microlib repository.
Interesting bits:
Umbrella test (integration)
Poncho tests (around the decisioning module) (I only wrote a few. It’s still a play project right now.)
Code under test (decisioning module)
Main program
Instruction carrying-out part

Diagrams made with Monodraw. Wanted to paste them in as ASCII instead of screenshots, but that’d be crap on mobile.

[1] This is Clojure, so I put the “contents of each file” in a delay. Files whose contents are not needed are never opened.
[2] I haven’t written property tests, because time.

The Quality Wheel

“Quality software.” It means something different to everyone who hears it.

You know quality when you see it, right? Or maybe when you smell it. Like a good perfume. Perfume preferences are different for everyone, and quality means something different for every application.

In perfume, we can discover and describe our preferences using the Fragrance Wheel. This is a spectrum of scent categories, providing a vocabulary for describing each perfume, the attributes of a scent.

Floral notes (Floral, Soft Floral); Oriental notes (Floral Oriental, Soft Oriental, Woody Oriental); Woody notes (Mossy woods, dry woods); Fresh notes (citrus, green, water)

Perhaps a similar construction could help with software quality?

When a developer talks about quality, we often mean code consistency and readability, plus automated testing. A tester means lack of bugs. A designer means a great UI, a user means great experience and exactly the right features and lack of errors or waiting. An analyst means insightful reporting and the right integrations, a system administrator means low CPU usage and consistent uptime and informative logging. Our partners mean well-documented, discoverable APIs and testing tools.

Usability (Features, Discoverability, User Experience); Performance (Responsiveness, Availability, Scalability); Flexibility (Speed of Evolution, Configurability); Correctness (Visibilty, Automated Tests, Accuracy)

Each of these are attributes of quality. For any given software system and for each component, different quality attributes matter most. What’s more, some aspects of quality compliment each other, each makes the other easier – for instance, a good design facilitates a great user experience. Readable code facilitates lack of bugs. Consistent uptime facilitates lack of waiting. Beautiful (consistent, modular, readable) code facilitates all the externally-visible aspects of quality.

However, other aspects of quality are in conflict. Quantity of features hurts code readability. More integrations leads to more error messages. Logging can increase response time.

If we add nuance to our vocabulary, we can discuss quality with more detail, less ambiguity. We can decide which attributes are essential to our software system, and to each piece of our system. Make the tradeoffs explicit, and allocate time and attention to carefully chosen quality attributes. This gets our system closer to something even greater: usefulness.

The quality wheel pictured above is oversimplified; it’s designed to parallel the original version of the Fragrance Wheel. I have a lot more quality attributes in mind. I’d love to have definitions of each piece, along with Chinese-Zodiac-style “compatible with/poor match” analysis. If this concept seems useful to you, please contribute your opinions in the comments, and we can expand this together.

Fun with Optional Typing: cheap mocking

For unit tests, it’s handy to mock out side-effecting functions so they don’t slow down tests.[1] Clojure has an easy way to do this: use with-redefs to override function definitions, and then any code within the with-redefs block uses those definitions instead.

To verify the input of the side-effecting function, I can override it with something that throws an exception if the input is wrong.[2] A quick way to do that is to check the input against a schema.

That turns out to be kinda pretty. For instance, if I need to override this function fetch-orders, I can enforce that it receives exactly the starting-date I expect, and a second argument that is not specified precisely, but still meets a certain condition.

(with-redefs [fetch-orders (s/fn [s :- (s/eq starting-date)
                                  e :- AtLeastAnHourAgo]
… )

Here, the s/fn macro creates a function that (when validation is activated[3]) checks its input against the schemas specified after the bird-face operator. The “equals” schema-creating function is built-in; the other I created myself with a descriptive name. The overriding function is declarative, no conditionals or explicit throwing or saving mutable state for later.

If I have a bug that switches the order of the inputs, this test fails. The exception that comes out isn’t pretty.

expected: (= expected-result (apply function-under-test input))
  actual: clojure.lang.ExceptionInfo: Input to fn3181 does not match schema: [(named (not (= # a-org.joda.time.DateTime)) s) nil]

Schema isn’t there yet on pretty errors. But hey, my test reads cleanly, it was simple to write, and I didn’t bring in a mocking framework.

See the full code (in the literate-test sort of style I’m experimenting with) on github.

[1] for the record, I much prefer writing code that’s a pipeline, so that I only have to unit-test data-in, data-out functions. Then side-effecting functions are only tested in integration tests, not mocked at all. But this was someone else’s code I was adding tests around.

[2] Another way to check the output is to have the override put its input into an atom, then check what happened during the assertion portion of the test. Sometimes that is cleaner.

[3] Don’t forget to (use-fixtures :once schema.test/validate-schemas) 

Fun with Optional Typing: narrowing errors

After moving from Scala to Clojure, I miss the types. Lately I’ve been playing with Prismatic Schema, a sort of optional typing mechanism for Clojure. It has some surprising benefits, even over Scala’s typing sometimes. I plan some posts about interesting ones of those, but first a more ordinary use of types: locating errors.

Today I got an error in a test, and struggled to figure it out. It looked like this:[1]

expected: (= [expected-conversion] result)
  actual: (not (= [{:click {:who {:uuid “aeiou”}, :when #}, :outcome {:who {:uuid “aeiou”}, :when #, :what “bought 3 things”}}] ([{:click {:who {:uuid “aeiou”}, :when #}, :outcome {:who {:uuid “aeiou”}, :when #, :what “bought 3 things”}}])))

Hideous, right? It’s super hard to see what’s different between the expected and actual there. (The colors help, but the terminal doesn’t give me those.)

It’s hard to find the difference because the difference isn’t content: it’s type. I expected a vector of a map, and got a list of a vector of a map. Joy.

I went back and added a few schemas to my functions, and the error changed to

  actual: clojure.lang.ExceptionInfo: Output of calculate-conversions-since does not match schema: [(not (map? a-clojure.lang.PersistentVector))]

This says my function output was a vector of a vector instead of a map. (This is one of Schema’s more readable error messages.)

Turns out (concat (something that returns a vector)) doesn’t do much; I needed to (apply concat to-the-vector).[2]

Clojure lets me keep the types in my head for as long as I want. Schema lets me write them down when they start to get out of hand, and uses them to narrow down where an error is. Even after I spotted the extra layer of sequence in my output, it could have been in a few places. Adding schemas pointed me directly to the function that wasn’t doing what I expected.

The real point of types is that they clarify my thinking and document it at the same time. They are a skeleton for my program. I like Clojure+Schema because it lets me start with a flexible pile of clay, and add bones as they’re needed.

[1] It would be less ugly if humane-test-output were activated, but I’m having technical difficulties with that at the moment.
[2] here’s the commit with the schemas and the fix.

A monadically built generator

Today, I wanted to write a post about code that sorts a vector of maps. But I can’t write that without a test, now can I? And not just any test — a property-based test! I want to be sure my function works all the time, for all valid input. Also, I don’t want to come up with representative examples – that’s too much work.[1]

The function under test is a custom-sort function, which accepts a bunch of rows (represented as a sequence of hashmaps) and a sequence of instructions: “sort by the value of A, descending; then the value of B, ascending.”

To test with all valid input, I must write code to generate all valid input. I need a vector of maps. The maps should have all the same keys. Some of those keys will be sort instructions. The values in the map can be anything Comparable: strings and ints for instance. Each instructions also includes a direction, ascending or descending. That’s a lot to put together.

For property-based (or “generative”) tests in Clojure, I’ll use test.check. To test a property, I must write a generator that produces input. How do I even start to create a generator this complicated?

Bit by bit! Start with the keys for the maps. Test.check has a generator for them:

(require ‘[clojure.test.check.generators :as gen])
gen/keyword ;; any valid clojure keyword.

The zeroth secret: I dug around in the source to find useful generators. If it seems like I’m pulling these out of my butt, well, this is what I ate.

Next I need multiple keywords, so add in gen/vector. It’s a function that takes a generator as an argument, and uses that repeatedly to create each element, producing a vector.

(gen/vector gen/keyword) ;; between 0 and some keywords

The first secret: generator composition. Put two together, get a better one out.

Since I want a set of keys, not a vector, it’s time for gen/fmap (“functor map,” as opposed to hashmap). That takes a function to run on each produced value before giving it to me, and its source generator.

(gen/fmap set (gen/vector gen/keyword)) ;; set of 0 or more keywords

It wouldn’t do for that set to be empty; my function requires at least 1 instruction, which means at least one keyword. gen/such-that narrows the possible output of the generator. It takes a predicate and a source generator:

(gen/such-that seq (gen/fmap set (gen/vector gen/keyword)))

If you’re not a seasoned Clojure dev: seq is idiomatic for “not empty.” Historical reasons.

This is enough to give me a set of keys, but it’s confusing, so I’m going to pull some of it out into a named function.

(defn non-empty-set [elem-g
  (gen/such-that seq (gen/fmap set (gen/vector elem-g))))

Here’s the generator so far:
(def maps-and-sort-instructions
  (let [set-of-keys  (non-empty-set gen/keyword)]

See what it gives me:
=> (gen/sample maps-and-sort-instructions
   ;; sample makes the generator produce ten values
(#{:Os} #{:? :f_Q_:_kpY:+:518} #{:? :-kZ:9_:_?Ok:JS?F} ….)

Ew. Nasty keywords I never would have come up with. But hey, they’re sets and they’re not empty.

To get maps, I need gen/hash-map. It wants keys, plus generators that produce values; from these it produces maps with a consistent structure, just like I want. It looks like:

(gen/hash-map :one-key gen-of-value :two-key gen-of-this-other-value …)

The value for each key could be anything Comparable really; I’ll settle for strings or ints. Later I can add more to this list. There’s gen/string and gen/int for those; I can choose among them with gen/elements.

(gen/elements [gen/string gen/int]) ;; one of the values in the input vector

I have now created a generator of generators. gen/elements is good for selecting randomly among a known sequence of values. I need a quantity of these value generators, the same quantity as I have keys.

(gen/vector (gen/elements [gen/string gen/int]) (count #??#)) 
  ;; gen/vector accepts an optional length

Well, crap. Now I have a dependency on what I already generated. Test.check alone doesn’t make this easy – you can do it, with some ugly use of gen/bind. Monads to the rescue! With a little plumbing, I can bring in algo.monad, and make the value produced from each generator available to the ones declared after it.

The second secret: monads let generators depend on each others’ output.

(require ‘[clojure.algo.monads :as m])
(m/defmonad gen-m
    [m-bind gen/bind
     m-result gen/return])

(def maps-and-sort-instructions
 (m/domonad gen-m
   [set-of-keys (non-empty-set gen/keyword)
    set-of-value-gens (gen/vector  
                       (gen/elements [gen/string gen/int]) 
                       (count set-of-keys))]
    [set-of-keys, set-of-value-gens])

I don’t recommend sampling this; generators don’t have nice toStrings. It’s time to put those keys and value-generators together, and pass them to gen/hash-map:

(apply gen/hash-map (mapcat vector set-of-keys set-of-value-generators))
  ;; intersperse keys and value-gens, then pass them to gen/hash-map

That’s a generator of maps. We need 0 or more maps, so here comes gen/vector again:

(def maps-and-sort-instructions
 (m/domonad gen-m
  [set-of-keys (non-empty-set gen/keyword)
   set-of-value-gens (gen/vector  
                      (gen/elements [gen/string gen/int]) 
                      (count set-of-keys))
   some-maps (gen/vector 
              (apply gen/hash-map 
               (mapcat vector set-of-keys 

This is worth sampling a few times:
=> (gen/sample maps-and-sort-instructions 3) ;; produce 3 values
([] [] [{:!6!:t4 “à$”, :*B 2, :K0:R*Hw:g:4!? “”}])

It randomly produced two empty vectors first, which is fine. It’s valid to sort 0 maps. If I run that sample more, I’ll see vectors with more maps in them.
Halfway there! Now for the instructions. Start with a subset of the map keys – there’s no subset generator, but I can build one using the non-empty-set defined earlier. I want a non-empty-set of elements from my set-of-keys.

(non-empty-set (gen/elements set-of-keys)) 
  ;; some-keys: 1 or more keys. 

To pair these instruction keys with directions, I’ll generate the right number of directions. Generating a direction means choosing between :ascending or :descending. This is a smaller generator that I can define outside:

(def mygen-direction-of-sort 
      (gen/elements [:ascending :descending])) 

and then to get a specific-length vector of these:

(gen/vector mygen-direction-of-sort (count some-keys)) 
   ;; some-directions

I’ll put the instruction keys with the directions together after the generation is all complete, and assemble the output:

(def maps-and-sort-instructions
 (m/domonad gen-m
  [set-of-keys (non-empty-set gen/keyword)
   set-of-value-gens (gen/vector  
                      (gen/elements [gen/string gen/int]) 
                      (count set-of-keys))
   some-maps (gen/vector 
              (apply gen/hash-map 
               (mapcat vector set-of-keys 
   some-keys (non-empty-set (gen/elements set-of-keys)) 
   some-directions (gen/vector mygen-direction-of-sort 
                               (count some-keys))]
   (let [instructions (map vector some-keys some-directions)] 
                           ;; pair keys with directions
    [some-maps instructions]))) ;; return maps and instructions

There it is, one giant generator, built of at least 11 small ones. That’s a lot of Clojure code… way too much to trust without a test. I need a property for my generator!
What is important about the output of this generator? Every instruction is a pair, every direction is either :ascending or :descending, and every key in the sort instructions is present in every map. I could also specify that the values for each key are all Comparable with each other, but I haven’t yet. This is close enough:

(def sort-instructions-are-compatible-with-maps
    [[rows instructions] maps-and-sort-instructions]
    (every? identity (for [[k direction] instructions
                          ;; break instructions into parts
              (and (#{:ascending :descending} direction
                    ;; Clojure looks so weird
                   (every? k rows)))))) 
                          ;; will be false if the key is absent

(require ‘[clojure.test.check :as tc])
(tc/quick-check 50 sort-instructions-are-compatible-with-maps)
;; {:result true, :num-tests 50, :seed 1412659276160}

Hurray, my property is true. My generator works. Now I can write a test… then maybe the code… then someday the post that I wanted to write tonight.

You might roll your eyes at me for going to these lengths to test code that’s only going to be used in a blog post. But I want code that works, not just two or three times but all the time. (Write enough concurrent code, and you notice the a difference between “I saw it work” and “it works.”) Since I’m working in Clojure, I can’t lean on the compiler to test the skeleton of my program. It’s all on me. And “I saw it work once in the REPL” isn’t satisfying.

Blake Meike points out on Twitter, “Nearly the entire Internet revolution… is based on works-so-far code.” So true! It’s that way at work. Maybe my free-time coding is the only coding I get to do right. Maybe that’s why open-source software has the potential to be more correct than commercial software. Maybe it’s the late-night principles of a few hungry-for-correctness programmers that move technology forward.


But it does feel good to write a solid property-based test.

[1] Coming up with examples is “work,” as opposed to “programming.”

Code for this post: https://github.com/jessitron/sortificate/blob/generator-post/test/sortificate/core_test.clj

TDD is Dead! Long Live TDD!

Imagine that you’re writing a web service. It is implemented with a bunch of classes. Pretend this circle represents your service, and the shapes inside it are classes.

The way I learned test-driven development[1], we wrote itty-bitty tests around every itty-bitty method in each class. Then maybe a few acceptance tests around the outside. This was supposed to help us drive design, and it was supposed to give us safety in refactoring. These automated tests would give us assurance, and make changing the code easier.

It doesn’t work out that way. Tests don’t enable change. Tests prevent change! In particular, when I want to refactor the internals of my service, any class I change means umpteen test changes. And all these tests include example == actual, and I’ve gotta figure out the new magic values that should pass. No fun! These method- or class-level tests are like bars in a cage preventing refactoring.

Tests prevent change, and there’s a place I want to prevent unintentional change: it’s at the service API level. At the outside, where other systems interact with this service, where a change in behavior could be a nasty surprise for some other team. Ideally, that’s where I want to put my automated tests.

Whoa, that is an ugly cage. At the service level, there are often many possible input scenarios. Testing every single one of them is painful. We probably can’t even think of every relevant combination and all the various edge cases. Much easier to zoom in to the class level and test one edge case at a time. Besides, even if we did write the dozens of tests to cover all the possibilities, what happens when the requirements change? Then we have great big tests with long expected == actual assertions, and we have to rework all of those. Bars in a cage, indeed.

Is TDD dead? Maybe it’s time to crown a new TDD. There’s a style of testing that addresses both of the difficulties in API-level testing: it finds all the scenarios and tames the profusion of hard-coded expectations. It’s called generative testing.[2]

Generative testing says, “I’m not gonna think of all the possible scenarios. I’m gonna write code that does it for me.” We write generators, which are objects that know how to produce random valid instances of various input types. The testing framework uses these to produce a hundred different random input scenarios, and runs all of them through the test.

Generative testing says, “I’m not gonna hard-code the output. I’m gonna make sure whatever comes out is good enough.” We can’t hard-code the output when we don’t know what the input is going to be. Instead, assertions are based on the relationship between the output and input. Sometimes we can’t be perfectly specific because we refuse to duplicate the code under test. In these cases we can establish boundaries around the output. Maybe, it should be between these values. It should go down as this input value goes up. It should never return more items than requested, that kind of thing.

With these, a few tests can cover many scenarios. Fortify with a few hard-coded examples if needed, and now half a dozen tests at the API level cover all the combinations of all the edge cases, as well as the happy paths.

This doesn’t preclude small tests that drive our class design. Use them, and then delete them. This doesn’t preclude example tests for documentation. Example-based, expected == actual tests, are stories, and people think in stories. Give them what they want, and give the computer what it wants: lots of juicy tests in one.

There are obstacles to TDD in this style. It’s way harder. It’s tough to find the assertions that draw a boundary around the acceptable results. There’s more thinking, less typing here. Lots more thinking, to find the assertions that draw a boundary around the acceptable output. That’s the hardest part, and it’s also the best part, because the real benefit of TDD is that it stops you from coding a solution to a problem you don’t understand.

look for more posts on this topic, to go along with my talks on it. See also my video about Property Based Testing in Scala

[1] The TDD I learned, at the itty-bitty level with mock all the things, was wrong. It isn’t what Kent Beck espoused. But it’s the easiest. [2] Or property-based testing, but that has NOTHING to do with properties on a class, so that name confuses people. Aside from that confusion I prefer “property-based”, which speaks about WHY we do this testing, over “generative”, which speaks about how.