Property Testing in Elm

Elm is perfectly suited to property testing, with its delightful data-in–data-out functions. Testing in Elm should super easy.

The tooling isn’t there yet, though. This post documents what was necessary today to get a property to run in Elm.

Step 1: elm-test

This includes an Elm library and a node module for a command-line runner. The library alone will let you create a web page of test results and look at it, but I want to run them in my build script and see results in my terminal.

Installation in an existing project:

elm package install deadfoxygrandpa/elm-test
npm install -g elm-test

The node module offers an “elm test init” functionality to put some test files in the current directory: TestRunner (which is the Main module for test runs[1]) and Tests.elm which holds actual tests. Personally, I found it necessary to follow the following steps as well.

  • create a test directory (I don’t want tests in my project home), and move the TestRunner.elm and Tests.elm files there.
  • add that test directory to the source directories in elm-package.json

Step 2: elm-check

The first thing to know is: which elm-check to install. You need the one from NoRedInk:

elm package install NoRedInk/elm-check

The next thing is: what to import. Where do all those methods used in the README live?

Here is a full program that lets elm-test execute the properties from the elm-check readme.
TL;DR: You need to import stuff from Check and Check.Producer for all properties; and  for the runner program, ElmTest and Check.Test and Signal, Console, and Task.

Name it test/Properties.elm and run it with

elm test test/Properties.elm

The output looks like

Successfully compiled test/Properties.elm
Running tests…
  1 suites run, containing 2 tests
  All tests passed

Here’s the full text just in case.

module Main (..) where
import ElmTest
import Check exposing (Evidence, Claim, that, is, for)
import Check.Test
import Check.Producer as Producer
import List
import Signal exposing (Signal)
import Console exposing (IO)
import Task

console : IO ()
console =
  ElmTest.consoleRunner (Check.Test.evidenceToTest evidence)

port runner : Signal (Task.Task x ())
port runner =
  Console.run console

myClaims : Claim
myClaims =
  Check.suite
    “List Reverse”
    [ Check.claim
        “Reversing a list twice yields the original list”
        `that` (\list -> List.reverse (List.reverse list))
        `is` identity
        `for` Producer.list Producer.int
    , Check.claim
        “Reversing a list does not modify its length”
        `that` (\list -> List.length (List.reverse list))
        `is` (\list -> List.length list)
        `for` Producer.list Producer.int
    ]

evidence : Evidence
evidence =
  Check.quickCheck myClaims

How to write properties is a post for another day. For now, at least this will get something running.

See also: a helpful post for running elm-check in phantom.js

[1] How does that even work? I thought modules needed the same name as their file name. Apparently this is not true of Main. You must name the module Main. You do not have to have a ‘main’ function in there (as of this writing). The command-line runner needs the ‘console’ function instead.

Ultratestable Coding Style

Darn side-effecting programs. Programs that change things in the outside world are so darn useful, and such a pain to test.
what's better than green? Ultra!For every piece of code, there is another piece of code that answers the question, “How do I know that code works?” Sometimes that’s more work than the code itself — but there is hope.

The other day, I made a program to copy some code from one project to another – two file copies, with one small change to the namespace declaration at the top of each file. Sounds trivial, right?

I know better: there are going to be a lot of subtleties. And this isn’t throwaway code. I need good, repeatable tests.

Where do I start? Hmm, I’ll need a destination directory with the expected structure, an empty source directory, files with the namespace at the top… oh, and cleanup code. All of these are harder than I expected, and the one test I did manage to write is specific to my filesystem. Writing code to verify code is so much harder than just writing the code!

Testing side-effecting code is hard. This is well established. It’s also convoluted, complex, generally brittle.
The test process looks like this:

input to code under test to output, but also prep the files in the right place and clear old files out, then the code under test does read & write on the filesystem, then check that the files are correct

Before the test, create the input AND go to the filesystem, prepare the input and the spot where output is expected.
After the test, check the output AND go to the filesystem, read the files from there and check their contents.
Everything is intertwined: the prep, the implementation of the code under test, and the checks at the end. It’s specific to my filesystem. And it’s slow. No way can I run more than a few of these each build.

The usual solution to this is to mock the filesystem. Use a ports-and-adapters approach. In OO you might use dependency injection; in FP you’d pass functions in for “how to read” and “how to write.” This isolates our code from the real filesystem. Test are faster and less tightly coupled to the environment. The test process looks like this:

Before the test, create the input AND prepare the mock read results and initialize the mock for write captures.
After the test, check the output AND interrogate the mock for write captures.

It’s an improvement, but we can do better. The test is still convoluted. Elaborate mocking frameworks might make it cleaner, but conceptually, all those ties are still there, with the stateful how-to-write that we pass in and then ask later, “What were your experiences during this test?”

If I move the side effects out of the code under test — gather all input beforehand, perform all writes afterward — then the decisionmaking part of my program becomes easier and more clear to test. It can look like this (code):

The input includes everything my decisions need to know from the filesystem: the destination directory and list of all files in it; the source directory and list plus contents of all files in it.
The output includes a list of instructions, for the side effects the code would like to perform. This is super easy to check at the end of a test.

The real main method looks different in this design. It has to gather all the input up front[1], then call the key program logic, then carry out the instructions. In order to keep all the decisionmaking, parsing, etc in the “code under test” block, I keep the interface to that function as close as possible to that of the built-in filesystem-interaction commands. It isn’t the cleanest interface, but I want all the parts outside “code-under-test” to be trivial.

simplest possible code to gather input, to well-tested code that makes all the decisions, to simplest-possible code to carry out instructions.

With this, I answer “How do I know this code works?” in two components. For the real-filesystem interactions, the documentation plus some playing around in the REPL tell me how they work. For the decisioning part of the program, my tests tell me it works. Manual tests for the hard-to-test bits, lots of tests for the hard-to-get-right bits. Reasoning glues them together.

Of course, I’m keeping my one umbrella test that interacts with the real filesystem. The decisioning part of the program is covered by poncho tests. With an interface like this, I can write property-based tests for my program, asserting things like “I never try to write a file in a directory that doesn’t exist” and “the output filename always matches the input filename.”[2]

As a major bonus, error handling becomes more modular. If, on trying to copy the second file, it isn’t found or isn’t valid, the second write instruction is replaced with an “error” instruction. Before any instructions are carried out, the program checks for “error” anywhere in the list (code). If found, stop before carrying out any real action. This way, validations aren’t separated in code from the operations they apply to, and yet all validations happen before operations are carried out. Real stuff happens only when all instructions are possible (as far as the program can tell). It’s close to atomic.

There are limitations to this straightforward approach to isolating decisions from side-effects. It works for this program because it can gather all the input, produce all the output, and hold all of it in memory at the same time. For a more general approach to this same goal, see Functional Programming in Scala.

Moving all the “what does the world around me look like?” side effects to the beginning of the program, and all the “change the world around me!” side effects to the end of the program, we achieve maximum testability of program logic. And minimum convolution. And separation of concerns: one module makes the decisions, another one carries them out. Consider this possibility the next time you find yourself in testing pain.


The code that inspired this approach is in my microlib repository.
Interesting bits:
Umbrella test (integration)
Poncho tests (around the decisioning module) (I only wrote a few. It’s still a play project right now.)
Code under test (decisioning module)
Main program
Instruction carrying-out part

Diagrams made with Monodraw. Wanted to paste them in as ASCII instead of screenshots, but that’d be crap on mobile.


[1] This is Clojure, so I put the “contents of each file” in a delay. Files whose contents are not needed are never opened.
[2] I haven’t written property tests, because time.

Property tests don’t have to be generative

Now and then, a property test can be easier than an example test. Today, Tanya and I benefited.

There’s this web service. It returns a whole tree of information, some of it useful and some of it is not.

{ “category”: “food”,
  “children: [ { “category” : “fruit”,
                  “children” : […LOTS MORE…],
                  “updatedAt” : “2014-06-30T16:22:36.440Z”,
                  “createdAt” : “2014-06-30T16:22:36.440Z”},
                 {“category” : “vegetables”,
                  “children” : […EVEN MORE…],
                  “updatedAt” : “2014-06-25T18:32:36.436Z”,
                  “createdAt” : “2014-06-25T18:32:36.436Z”}],
  “updatedAt” : “2014-06-15T16:32:36.550Z”,
  “createdAt” : “2014-03-05T08:12:46.440Z” }

The service is taking a while, mostly because it’s returning half a meg of data. Removing the useless fields will cut that in half.

Being good developers, we want to start with a failing test. An example test might start with inserting data in the database, perhaps after clearing the table out, so we can know what the service should return. That’s a lot of work, when I don’t really care what’s returned, as long as it doesn’t include those updatedAt and createdAt fields.

Currently when we test the function implementing this service, there’s some sample data lying around. If we write a property test instead of an example test, that data is good enough. As long as the service returns some data, and it’s something with children so we can check nested values, it’s good enough. I can test that: (actual code)

(deftest streamline-returned-tree
  (testing “boring fields are not returned”
    (let [result (method-under-test)]
      (is (seq (:children result))))))

This is not a generative test, because it doesn’t generate its own data, and it doesn’t run repeatedly. Yet it is a property test, because it’s asserting a property of the result. The test doesn’t say “expected equals actual.” Instead, it says “The result has children.”

This test passes, and now we can add the property of interest: that the map returned by method-under-test has no :createdAt or :updatedAt keys, at any level. We could find or write a recursive function to dig around in the maps and nested vectors of maps, but that same function would also be useful in the implementation. Duplicating that code in the test is no good.

One of the classic challenges of property testing is finding two different ways to do the same thing. Then we can make assertions without knowing the input. But… nobody said they have to be two smart ways of doing the same thing! I want to be sure there’s no “createdAt blah blah” in the output, so how about we write that nested map to a string and look for “createdAt” in that?

(deftest streamline-returned-tree
  (testing “boring fields are not returned”
    (let [result (method-under-test)
          result-string (pr-str result)]
      (is (seq (:children result)))
      (is (not (.contains result-string:createdAt“))))))

This gives us a failing test, and it was a heck of a lot easier to implement than an example test which hard-codes expected results. This test is specific about its purpose. As a bonus, it doesn’t use any strategy we’d ever consider using in the implementation. The print-it-to-a-string idea, which sounded stupid at first, expresses the intention of “we don’t want this stuff included.”

Property tests don’t have to be generative, and they don’t have to be clever. Sometimes it’s the dumb tests that work the best.


Bonus material:
the output of this failing test is

expected: (not (.contains result-string:createdAt“))
  actual: (not (not true))

This “not not true” actual result… yeah, not super useful. clojure-test’s “is” macro responds better to a comparison function than to nested calls. If I define a function not-contains, then I can get:

expected: (not-contains result-string “:createdAt“)
  actual: (not (not-contains 
                “{:children [{:createdAt \”yesterday\”, :category \”fruit\”}], :createdAt \”last week\”, :category \”food\”}” 
                “:createdAt“))

That’s a little more useful, since it shows what it’s comparing.

A monadically built generator

Today, I wanted to write a post about code that sorts a vector of maps. But I can’t write that without a test, now can I? And not just any test — a property-based test! I want to be sure my function works all the time, for all valid input. Also, I don’t want to come up with representative examples – that’s too much work.[1]

The function under test is a custom-sort function, which accepts a bunch of rows (represented as a sequence of hashmaps) and a sequence of instructions: “sort by the value of A, descending; then the value of B, ascending.”

To test with all valid input, I must write code to generate all valid input. I need a vector of maps. The maps should have all the same keys. Some of those keys will be sort instructions. The values in the map can be anything Comparable: strings and ints for instance. Each instructions also includes a direction, ascending or descending. That’s a lot to put together.

For property-based (or “generative”) tests in Clojure, I’ll use test.check. To test a property, I must write a generator that produces input. How do I even start to create a generator this complicated?

Bit by bit! Start with the keys for the maps. Test.check has a generator for them:

(require ‘[clojure.test.check.generators :as gen])
gen/keyword ;; any valid clojure keyword.

The zeroth secret: I dug around in the source to find useful generators. If it seems like I’m pulling these out of my butt, well, this is what I ate.

Next I need multiple keywords, so add in gen/vector. It’s a function that takes a generator as an argument, and uses that repeatedly to create each element, producing a vector.

(gen/vector gen/keyword) ;; between 0 and some keywords

The first secret: generator composition. Put two together, get a better one out.

Since I want a set of keys, not a vector, it’s time for gen/fmap (“functor map,” as opposed to hashmap). That takes a function to run on each produced value before giving it to me, and its source generator.

(gen/fmap set (gen/vector gen/keyword)) ;; set of 0 or more keywords

It wouldn’t do for that set to be empty; my function requires at least 1 instruction, which means at least one keyword. gen/such-that narrows the possible output of the generator. It takes a predicate and a source generator:

(gen/such-that seq (gen/fmap set (gen/vector gen/keyword)))

If you’re not a seasoned Clojure dev: seq is idiomatic for “not empty.” Historical reasons.

This is enough to give me a set of keys, but it’s confusing, so I’m going to pull some of it out into a named function.

(defn non-empty-set [elem-g
  (gen/such-that seq (gen/fmap set (gen/vector elem-g))))

Here’s the generator so far:
(def maps-and-sort-instructions
  (let [set-of-keys  (non-empty-set gen/keyword)]
     set-of-keys)

See what it gives me:
=> (gen/sample maps-and-sort-instructions
   ;; sample makes the generator produce ten values
(#{:Os} #{:? :f_Q_:_kpY:+:518} #{:? :-kZ:9_:_?Ok:JS?F} ….)

Ew. Nasty keywords I never would have come up with. But hey, they’re sets and they’re not empty.

To get maps, I need gen/hash-map. It wants keys, plus generators that produce values; from these it produces maps with a consistent structure, just like I want. It looks like:

(gen/hash-map :one-key gen-of-value :two-key gen-of-this-other-value …)

The value for each key could be anything Comparable really; I’ll settle for strings or ints. Later I can add more to this list. There’s gen/string and gen/int for those; I can choose among them with gen/elements.

(gen/elements [gen/string gen/int]) ;; one of the values in the input vector

I have now created a generator of generators. gen/elements is good for selecting randomly among a known sequence of values. I need a quantity of these value generators, the same quantity as I have keys.

(gen/vector (gen/elements [gen/string gen/int]) (count #??#)) 
  ;; gen/vector accepts an optional length

Well, crap. Now I have a dependency on what I already generated. Test.check alone doesn’t make this easy – you can do it, with some ugly use of gen/bind. Monads to the rescue! With a little plumbing, I can bring in algo.monad, and make the value produced from each generator available to the ones declared after it.

The second secret: monads let generators depend on each others’ output.

(require ‘[clojure.algo.monads :as m])
(m/defmonad gen-m
    [m-bind gen/bind
     m-result gen/return])

(def maps-and-sort-instructions
 (m/domonad gen-m
   [set-of-keys (non-empty-set gen/keyword)
    set-of-value-gens (gen/vector  
                       (gen/elements [gen/string gen/int]) 
                       (count set-of-keys))]
    [set-of-keys, set-of-value-gens])

I don’t recommend sampling this; generators don’t have nice toStrings. It’s time to put those keys and value-generators together, and pass them to gen/hash-map:

(apply gen/hash-map (mapcat vector set-of-keys set-of-value-generators))
  ;; intersperse keys and value-gens, then pass them to gen/hash-map

That’s a generator of maps. We need 0 or more maps, so here comes gen/vector again:

(def maps-and-sort-instructions
 (m/domonad gen-m
  [set-of-keys (non-empty-set gen/keyword)
   set-of-value-gens (gen/vector  
                      (gen/elements [gen/string gen/int]) 
                      (count set-of-keys))
   some-maps (gen/vector 
              (apply gen/hash-map 
               (mapcat vector set-of-keys 
                              set-of-value-gens)))]
  some-maps))

This is worth sampling a few times:
=> (gen/sample maps-and-sort-instructions 3) ;; produce 3 values
([] [] [{:!6!:t4 “à$”, :*B 2, :K0:R*Hw:g:4!? “”}])

It randomly produced two empty vectors first, which is fine. It’s valid to sort 0 maps. If I run that sample more, I’ll see vectors with more maps in them.
Halfway there! Now for the instructions. Start with a subset of the map keys – there’s no subset generator, but I can build one using the non-empty-set defined earlier. I want a non-empty-set of elements from my set-of-keys.

(non-empty-set (gen/elements set-of-keys)) 
  ;; some-keys: 1 or more keys. 

To pair these instruction keys with directions, I’ll generate the right number of directions. Generating a direction means choosing between :ascending or :descending. This is a smaller generator that I can define outside:

(def mygen-direction-of-sort 
      (gen/elements [:ascending :descending])) 

and then to get a specific-length vector of these:

(gen/vector mygen-direction-of-sort (count some-keys)) 
   ;; some-directions

I’ll put the instruction keys with the directions together after the generation is all complete, and assemble the output:

(def maps-and-sort-instructions
 (m/domonad gen-m
  [set-of-keys (non-empty-set gen/keyword)
   set-of-value-gens (gen/vector  
                      (gen/elements [gen/string gen/int]) 
                      (count set-of-keys))
   some-maps (gen/vector 
              (apply gen/hash-map 
               (mapcat vector set-of-keys 
                              set-of-value-gens)))
   some-keys (non-empty-set (gen/elements set-of-keys)) 
   some-directions (gen/vector mygen-direction-of-sort 
                               (count some-keys))]
        
   (let [instructions (map vector some-keys some-directions)] 
                           ;; pair keys with directions
    [some-maps instructions]))) ;; return maps and instructions

There it is, one giant generator, built of at least 11 small ones. That’s a lot of Clojure code… way too much to trust without a test. I need a property for my generator!
What is important about the output of this generator? Every instruction is a pair, every direction is either :ascending or :descending, and every key in the sort instructions is present in every map. I could also specify that the values for each key are all Comparable with each other, but I haven’t yet. This is close enough:

(def sort-instructions-are-compatible-with-maps
  (prop/for-all
    [[rows instructions] maps-and-sort-instructions]
    (every? identity (for [[k direction] instructions
                          ;; break instructions into parts
              (and (#{:ascending :descending} direction
                    ;; Clojure looks so weird
                   (every? k rows)))))) 
                          ;; will be false if the key is absent

(require ‘[clojure.test.check :as tc])
(tc/quick-check 50 sort-instructions-are-compatible-with-maps)
;; {:result true, :num-tests 50, :seed 1412659276160}

Hurray, my property is true. My generator works. Now I can write a test… then maybe the code… then someday the post that I wanted to write tonight.

You might roll your eyes at me for going to these lengths to test code that’s only going to be used in a blog post. But I want code that works, not just two or three times but all the time. (Write enough concurrent code, and you notice the a difference between “I saw it work” and “it works.”) Since I’m working in Clojure, I can’t lean on the compiler to test the skeleton of my program. It’s all on me. And “I saw it work once in the REPL” isn’t satisfying.

Blake Meike points out on Twitter, “Nearly the entire Internet revolution… is based on works-so-far code.” So true! It’s that way at work. Maybe my free-time coding is the only coding I get to do right. Maybe that’s why open-source software has the potential to be more correct than commercial software. Maybe it’s the late-night principles of a few hungry-for-correctness programmers that move technology forward.

Nah.

But it does feel good to write a solid property-based test.

————-
[1] Coming up with examples is “work,” as opposed to “programming.”

Code for this post: https://github.com/jessitron/sortificate/blob/generator-post/test/sortificate/core_test.clj

TDD with generative testing: an example in Ruby

Say I’m in retail, and the marketing team has an app that helps them evaluate the sales of various items. I’m working on a web service that, given an item, tells them how many purchases were influenced by various advertising channels: the mobile app, web ads, and spam email.

The service will look up item purchase records from one system, then access the various marketing systems that know about ad clicks, mobile app usage, and email sends. It returns how many purchases were influenced by each, and uses some magic formula to calculate the relevance of each channel to this item’s sales.

My goal is to test this thoroughly at the API level. I can totally write an example-based test for this, with a nice happy-path input and hard-coded expected output. And then, I need to test edge cases and error cases. When no channels have impact; when they all have the same impact; when one fails; when they timeout; et cetera et cetera.

Instead, I want to write a few generative tests. What might they look like?

When I’m using test-driven development with generative testing, I start at the outside. What can I say about the output of this service? For each channel, the number of influenced purchases can’t be bigger than the total purchases. And the relevance number should be between 0 and 100, inclusive. I can assert that in rspec.

expect(influenced_purchases).to be <= total_purchases
expect(relevance).to be >= 0
expect(relevance).to be <= 100

These kinds of assertions are called “properties”. Here, “property” has NOTHING TO DO with a field on a class. In this context, a property is something that is always true for specified circumstances. The generated input will specify the circumstances.

To test this, I’ll need to run some input through my service and then make these checks for each output circle. It needs some way to query the purchase and marketing services, and I’m not going to make real calls a hundred times. Therefore my service will use adapters to access the outside world, and test adapters will serve up data.

result = InfluenceService.new(TestPurchaseAdapter.new(purchases),
                make_adapters(channel_events)).investigate(item)

result.channels.each do |(channel, influence)|
  expect(influence.influenced_purchases).to be <= total_purchases
  expect(influence.relevance).to be >= 0
  expect(influence.relevance).to be <= 100
end

(relatively complete code sample here.)
To do this, I need purchases, events on each channel, and an item. My test needs to generate these 100 times, and then do the assertions 100 times. I can use rantly for this. The test looks like this:

it “returns a reasonable amount of influence” do
 property_of {
  … return an array [purchases, channel_events, item] …
 }.check do |(purchaseschannel_eventsitem)|
  total_purchases = purchases.size
  result = InfluenceService.new(TestPurchaseAdapter.new(purchases),
                make_adapters(channel_events)).investigate(item)

  result.channels.each do |(channel, influence)|
   expect(influence.influenced_purchases).to be <= total_purchases
   expect(influence.relevance).to be >= 0
   expect(influence.relevance).to be <= 100
  end
 end
end

(Writing generators needs a post of its own.)
Rantly will call the property_of block, and pass its result into the check block, 100 times or until it finds a failure. Its objective is to disprove the property (which we assert is true for all input) by finding an input value that makes the assertions fail. If it does, it prints out that input value, so you can figure out what’s failing.

It does more than that, actually: it attempts to find the simplest input that makes the property fail. This makes finding the problem easier. It also helps me with TDD, because it boils this general test into the simplest case, the same place I might have started with traditional TDD.

In my TDD cycle, I make this test compile. Then it fails, and rantly reports the simplest case: no purchases, no events, any old item. After I make that test pass, rantly reports another simple input case that fails. Make that work. Repeat.

Once this test passes, all I have is a stub implementation. Now what? It’s time to add properties gradually. Usually at this point I sit back and think for a while. Compared to example-based TDD, generative testing is a lot more thinking and less typing. How can I shrink the boundaries?

It’s time for another post on relative properties.

TDD is Dead! Long Live TDD!

Imagine that you’re writing a web service. It is implemented with a bunch of classes. Pretend this circle represents your service, and the shapes inside it are classes.

The way I learned test-driven development[1], we wrote itty-bitty tests around every itty-bitty method in each class. Then maybe a few acceptance tests around the outside. This was supposed to help us drive design, and it was supposed to give us safety in refactoring. These automated tests would give us assurance, and make changing the code easier.

It doesn’t work out that way. Tests don’t enable change. Tests prevent change! In particular, when I want to refactor the internals of my service, any class I change means umpteen test changes. And all these tests include example == actual, and I’ve gotta figure out the new magic values that should pass. No fun! These method- or class-level tests are like bars in a cage preventing refactoring.

Tests prevent change, and there’s a place I want to prevent unintentional change: it’s at the service API level. At the outside, where other systems interact with this service, where a change in behavior could be a nasty surprise for some other team. Ideally, that’s where I want to put my automated tests.

Whoa, that is an ugly cage. At the service level, there are often many possible input scenarios. Testing every single one of them is painful. We probably can’t even think of every relevant combination and all the various edge cases. Much easier to zoom in to the class level and test one edge case at a time. Besides, even if we did write the dozens of tests to cover all the possibilities, what happens when the requirements change? Then we have great big tests with long expected == actual assertions, and we have to rework all of those. Bars in a cage, indeed.

Is TDD dead? Maybe it’s time to crown a new TDD. There’s a style of testing that addresses both of the difficulties in API-level testing: it finds all the scenarios and tames the profusion of hard-coded expectations. It’s called generative testing.[2]

Generative testing says, “I’m not gonna think of all the possible scenarios. I’m gonna write code that does it for me.” We write generators, which are objects that know how to produce random valid instances of various input types. The testing framework uses these to produce a hundred different random input scenarios, and runs all of them through the test.

Generative testing says, “I’m not gonna hard-code the output. I’m gonna make sure whatever comes out is good enough.” We can’t hard-code the output when we don’t know what the input is going to be. Instead, assertions are based on the relationship between the output and input. Sometimes we can’t be perfectly specific because we refuse to duplicate the code under test. In these cases we can establish boundaries around the output. Maybe, it should be between these values. It should go down as this input value goes up. It should never return more items than requested, that kind of thing.

With these, a few tests can cover many scenarios. Fortify with a few hard-coded examples if needed, and now half a dozen tests at the API level cover all the combinations of all the edge cases, as well as the happy paths.

This doesn’t preclude small tests that drive our class design. Use them, and then delete them. This doesn’t preclude example tests for documentation. Example-based, expected == actual tests, are stories, and people think in stories. Give them what they want, and give the computer what it wants: lots of juicy tests in one.

There are obstacles to TDD in this style. It’s way harder. It’s tough to find the assertions that draw a boundary around the acceptable results. There’s more thinking, less typing here. Lots more thinking, to find the assertions that draw a boundary around the acceptable output. That’s the hardest part, and it’s also the best part, because the real benefit of TDD is that it stops you from coding a solution to a problem you don’t understand.

look for more posts on this topic, to go along with my talks on it. See also my video about Property Based Testing in Scala


[1] The TDD I learned, at the itty-bitty level with mock all the things, was wrong. It isn’t what Kent Beck espoused. But it’s the easiest. [2] Or property-based testing, but that has NOTHING to do with properties on a class, so that name confuses people. Aside from that confusion I prefer “property-based”, which speaks about WHY we do this testing, over “generative”, which speaks about how.

Quick reference: monads and test.check generators

Combine monads with test.check generators to build them up out of smaller generators with dependencies:

(require ‘[clojure.test.check.generators :as gen])
(require ‘[clojure.algo.monads :as m])
(m/defmonad gen-m 
  [m-bind gen/bind 
   m-result gen/return])

(def vector-and-elem
  (m/domonad gen-m
    [n (gen/choose 1 10)
     v (gen/vector gen/int n)
     e (gen/element v)]
    [v, e]))

(gen/sample vector-and-elem)
;; ([[0 0] 0] 
    [[0 -1 1 0 -1 0 -1 1] 0] 
    [[1 1 3 3 3 -1 0 -2 2] 3]
    [[8 4] 8]…

The generator here chooses a vector length, uses that to generate a vector, uses that to pick an element inside the vector, and then returns a tuple of the vector and the element. The syntax is cleaner than a lot of gen/bind and gen/fmap calls. It looks a lot like ScalaCheck.

I suspect we could define m-zero and m-plus in the monad to get :when conditions as well.

I’m working on a longer post that explains what’s going on here and why we would do it.

Testing akka actor termination

When testing akka code, I want to make sure a particular actor gets shut down within a time limit. I used to do it like this:

 Thread.sleep(2.seconds)
 assertTrue(actorRef.isTerminated())

That isTerminated method is deprecated since Akka 2.2, and good thing too, since my test was wasting everyone’s time. Today I’m doing this instead:

import akka.testkit.TestProbe

val probe = new TestProbe(actorSystem)
probe.watch(actorRef)
probe.expectMsgPF(2.seconds){ case Terminated(actorRef) => true }

This says: set up a TestProbe actor, and have it watch the actorRef of interest. Wait for the TestProbe to receive notification that the actor of interest has been terminated. If actorRef has already terminated, that message will come right away. My test doesn’t have to wait the maximum allowed time.[1]

This works in any old test method with access to the actorSystem — I don’t have to extend akka.testkit.TestKit to use the TestProbe.

BONUS: In a property-based test, I don’t want to throw an exception, but rather return a result, a property with a nice label. In that case my function gets a little weirder:

def shutsDown(actorSystem: ActorSystem, 
              actorRef: ActorRef): Prop = {
  val maxWait = 2.seconds
  val probe = new TestProbe(actorSystem)
  probe.watch(actorRef)
  try {
   probe.expectMsgPF(maxWait){case Terminated(actorRef) => true }
  } catch { 
   case ae: AssertionError => 
    false 😐 s”actor not terminated within $maxWait
  }
}

———–
[1] This is still blocking the thread until the Terminated message is received or the timeout expires. I eagerly await the day when test methods can return a Future[TestResult].

Property-based testing of higher-order functions

Property-based testing and functional programming are friends, because they’re both motivated by Reasoning About Code. Functional programming is about keeping your functions data-in, data-out so that you can reason about them. Property-based testing is about expressing the conclusions of that reasoning as properties, then showing that they’re (probably) true by testing them with hundreds of different input values.

example:
Person: “By my powers of reasoning, I can tell that this code will always produce an output collection longer than the input collection.”

Test: “For any generated collection listypoo, this function will return a collection of size > listypoo.size.”

Testing Framework: “I will generate 100 different collections, throw them at that test, and try to make it fail.”

Property-based testing gets tricky when I try to combine it with another aspect of functional programming: higher-order functions. If the function under test accepts a function as a parameter, how do I generate that? If it returns a function, how do I validate it’s the right one?

In the trivial case, we pass in the identity function or a constant function, and base our expectations on that. But that violates the spirit of generative testing.

Test operate on values. Values go in, values go out and we can check those. Functions are tested by operating on values. So to test functions that operate on functions, we need to generate functions that operate on values, and generate values to test the functions that came out of the function under test. Then we’ll need to state properties in terms of values going in and out of functions going in and out… ack! It’s Higher-Order-Property-Based Testing.

My sources on twitter tell me that QuickCheck for Erlang and Haskell have generators for functions. Looking at this sample of the Haskell one, I think this is what it does:

Each generated function from A => B includes a random factor that’s constant to the function, and a generator for B that can create Bs based on a seed. When an A is passed in, it’s combined with the function’s random factor to prod the generator into creating a deterministically-random B.

So, we can create random functions A=>B if we have a generator of B, a random-factor-generator, and some mapping from (A, random-factor) to the seed for the generator of B. That doesn’t sound so bad.

There are two other functionalities in good generators. When a test fails, we need enough information to figure out why and fix the bug. One tool for this is shrinking: make the generated input simpler and simpler until the test doesn’t fail anymore, and then report the simplest failing case. I haven’t even started thinking about that in function-generation. (Does Haskell or Erlang do it?)
The second tool for fixing failed tests is simpler: print out the input that revealed the bug. That’s also hard with generated functions. Since the generated functions described above are random mappings from input to output, printing the actual input-output mapping used in the failing property evaluation should be sufficient. This is implemented in Haskell, with no state, so it must be possible in Scala and Erlang. And it’s in F#.

Like everything in property-based testing, generic input generation is the easy part (even when it isn’t easy). Describing properties expected of output functions based on input functions – now that sounds hard. When I write a test that does it, that’ll be worth another post.

Thanks to @reiddraper, @ericnormand, @silentbicycle, @natpryce, @deech, @kot_2010 and everyone who chimed in. Please continue.