Provenance and causality in distributed systems

Can you take a piece of data in your system and say what version of code put it in there, based on what messages from other systems? and what information a human viewed before triggering an action?

Me neither.

Why is this acceptable? (because we’re used to it.)
We could make this possible. We could trace the provenance of data. And at the same time, mostly-solve one of the challenges of distributed systems.

Speaking of distributed systems…

In a distributed system (such as a web app), we can’t say for sure what events happened before others. We get into general relativity complications even at short distances, because information travels through networks at unpredictable speeds. This means there is no one such thing as time, no single sequence of events that says what happened before what. There is time-at-each-point, and inventing a convenient fiction to reconcile them is a pain in the butt.

We usually deal with this by funneling every event through a single point: a transactional database. Transactions prevent simultaneity. Transactions are a crutch.

Some systems choose to apply an ordering after the fact, so that no clients have to wait their turn in order to write events into the system. We can construct a total ordering, like the one that the transactional database is constructing in realtime, as a batch process. Then we have one timeline, and we can use this to think about what events might have caused which others. Still: putting all events in one single ordering is a crutch. Sometimes, simultaneity is legit.

When two different customers purchase two different items from two different warehouses, it does not matter which happened first. When they purchase the same item, it still doesn’t matter – unless we only find one in inventory. And even then: what matters more, that Justyna pushed “Buy” ten seconds before Edith did, or that Edith upgraded to 1-day shipping? Edith is in a bigger hurry. Prioritizing these orders is a business decision. If we raise the time-ordering operation to the business level, we can optimize that decision. At the same time, we stop requiring the underlying system to order every event with respect to every other event.

On the other hand, there are events that we definitely care happened in a specific sequence. If Justyna cancels her purchase, that was predicated on her making it. Don’t mix those up. Each customer saw a specific set of prices, a tax amount, and an estimated ship date. These decisions made by the system caused (in part) the customer’s purchase. They must be recorded either as part of the purchase event, or as events that happened before the purchase.

Traditionally we record prices and estimated ship date as displayed to the customer inside the purchase. What if instead, we thought of the pricing decision and the ship date decision as events that happened before the purchase? and the purchase recorded that those events definitely happened before the purchase event?

We would be working toward establishing a different kind of event ordering. Did Justyna’s purchase happen before Edith’s? We can’t really say; they were at different locations, and neither influenced the other. That pricing decision though, that did influence Justyna’s purchase, so the price decision happened before the purchase.

This allows us to construct a more flexible ordering, something wider than a line.

Causal ordering

Consider a git history. By default, git log prints a line of commits as if they happened in that order — a total ordering.

But that’s not reality. Some commits happen before others: each commit I make is based on its parent, and every parent of that parent commit, transitively. So the parent happened before mine. Meanwhile, you might commit to a different branch. Whether my commit happened before yours is irrelevant. The merge commit brings them together; both my commit and yours happen before the merge commit, and after the parent commit. There’s no need for a total ordering here. The graph expresses that.

This is a causal ordering. It doesn’t care so much about clock time. It cares what commits I worked from when I made mine. I knew about the parent commit, I started from there, so it’s causal. Whatever you were doing on your branch, I didn’t know about it, it wasn’t causal, so there is no “before” or “after” relationship to yours and mine.

We can see the causal ordering clearly, because git tracks it: each commit knows its parents. The cause of each commit is part of the data in the commit.

Back to our retail example. If we record each event along with the events that caused it, then we can make a graph with enough of a causal ordering.

There are two reasons we want an ordering here: external consistency and internal legibility.

External Consistency

External consistency means that Justyna’s experience remains true. Some events are messages from our software system to Justyna (the price is $), and others are messages coming in (Confirm Purchase, Cancel Purchase). The sequence of these external interactions constrains any event ordering we choose. Messages crossing the system boundary must remain valid.

Here’s a more constricting example of external consistency: when someone runs a report and sees a list of transactions for the day, that’s an external message. That message is caused by all the transactions reported in it. If another transaction comes in late, it must be reported later as an amendment to that original report — whereas, if no one had run the report for that day yet, it could be lumped in with the other ones. No one needs to know that it was slow, if no one had looked.

Have you ever run a report, sent the results up the chain, and then had the central office accuse you of fudging the numbers because they run the same report (weeks later) and see different totals? This happens in some organizations, and it’s a violation of external consistency.

Internal Legibility

Other causal events are internal messages: we displayed this price because the pricing system sent us a particular message. The value of retaining causal information here is troubleshooting, and figuring out how our system works.

I’m using the word “legibility”[1] in the sense of “understandability:” as a person we have visibility into the system’s workings, we can follow along with what it’s doing. Distinguish its features, locate problems and change it.

 If Justyna’s purchase event is caused by a ship date decision, and the ship date decision (“today”) tracked its causes (“the inventory system says we have one, with more arriving today”), then we can construct a causal ordering of events. If Edith’s purchase event tracked a ship date decision (“today”) which tracked its causes (“the inventory system says we have zero, with more arriving today”), then we can track a problem to its source. If in reality we only send one today, then it looks like the inventory system’s shipment forecasts were inaccurate.

How would we even track all this?

The global solution to causal ordering is: for every message sent by a component in the system, record every message received before that. Causality at a point-in-time-at-a-point-in-space is limited to information received before that point in time, at that point in space. We can pass this causal chain along with the message.

“Every message received” is a lot of messages. Before Justyna confirmed that purchase, the client component received oodles of messages, from search results, from the catalog, from the ad optimizer, from the review system, from the similar-purchases system, from the categorizer, many more. The client received and displayed information about all kinds of items Justyna did not purchase. Generically saying “this happened before, therefore it can be causal, so we must record it ALL” is prohibitive.

This is where business logic comes in. We know which of these are definitely causal. Let’s pass only those along with the message.

There are others that might be causal. The ad optimizer team probably does want to know which ads Justyna saw before her purchase. We can choose whether to include that with the purchase message, or to reconstruct an approximate timeline afterward based on clocks in the client or in the components that persist these events. For something as aggregated as ad optimization, approximate is probably good enough. This is a business tradeoff between accuracy and decoupling.

Transitive causality

How deep is the causal chain passed along with a message?

We would like to track backward along this chain. When we don’t like the result of Justyna and Edith’s purchase fulfillment, we trace it back. Why did the inventory system said the ship date would be today in both cases. This decision is an event, with causes of “The current inventory is 1” and “Normal turnover for this item is less than 1 per day”; or “The current inventory is 0” and “a shipment is expected today” and “these shipments usually arrive in time to be picked the same day.” From there we can ask whether the decision was valid, and trace further to learn whether each of these inputs was correct.

If every message comes with its causal events, then all of this data is part of the “Estimated ship date today” sent from the inventory system to the client. Then the client packs all of that into its “Justyna confirmed this purchase” event. Even with slimmed-down, business-logic-aware causal listings, messages get big fast.

Alternately, the inventory system could record its decision, and pass a key with the message to the client, and then the client only needs to retain that key. Recording every decision means a bunch of persistent storage, but it doesn’t need to be fast-access. It’d be there for troubleshooting, and for aggregate analysis of system performance. Recording decisions along with the information available at the time lets us evaluate those decisions later, when outcomes are known.


A system component that chooses to retain causality in its events has two options: repeat causal inputs in the messages it sends outward; or record the causal inputs and pass a key in the messages it sends outward.

Not every system component has to participate. This is an idea that can be rolled out gradually. The client can include in the purchase event as much as its knows: the messages it received, decisions it made, and relevant messages sent outward before this incoming “Confirm Purchase” message was received from Justyna. That’s useful by itself, even when the inventory system isn’t yet retaining its causalities.

Or the inventory system could record its decisions, the code version that made them, and the inputs that contributed to them, even though the client doesn’t retain the key it sends in the message. It isn’t as easy to find the decision of interest without the key, but it could still be possible. And some aggregate decision evaluation can still happen. Then as other system components move toward the same architecture, more benefits are realized.

Conscious Causal Ordering

The benefits of a single, linear ordering of events are consistency, legibility, and visibility into what might be causal. A nonlinear causal ordering gives us more flexibility, consistency, a more accurate but less simplified legibility, and clearer visibility into what might be causal. Constructing causal ordering at the generic level of “all messages received cause all future messages sent” is expensive and also less meaningful than a business-logic-aware, conscious causal ordering. This conscious causal ordering gives us external consistency, accurate legibility, and visibility into what we know to be causal.

At the same time, we can have provenance for data displayed to the users or recorded in our databases. We can know why each piece of information is there, and we can figure out what went wrong, and we can trace all the data impacted by an incorrect past event.

I think this is something we could do, it’s within our ability today. I haven’t seen a system that does it, yet. Is it because we don’t care enough — that we’re willing to say “yeah, I don’t know why it did that, can’t reproduce, won’t fix”? Is it because we’ve never had it before — if we once worked in a system with this kind of traceability, would we refuse to ever go back?

[1] This concept of “legibility” comes from the book Seeing Like a State.

An Opening Example of Elm: building HTML by parsing parameters

I never enjoyed front-end development, until I found Elm. JavaScript with its `undefined`, its untyped functions, its widely scoped mutable variables. It’s like Play-Doh, it’s so malleable. And when I try to make a sculpture, the arms fall off. It takes a lot of skill to make Play-Doh look good.

Then Richard talked me into trying Elm. Elm is more like Lego Technics. Fifteen years ago, I bought and built a Lego Technics space shuttle, and twelve years ago I gave up on getting that thing apart. It’s still in my attic. Getting those pieces to fit together takes some work, but once you get there, they’re solid. You’ll never get “method not found on `undefined`” from your Elm code.

Elm is a front-end, typed functional language; it to JavaScript for use in the browser. It’s a young language (as of 2015), full of opportunity and surprises. My biggest surprise so far: I do like front-end programming!

To guarantee that you never get `undefined` and never call a method that doesn’t exist, all Elm functions are Data in, Data out. All data is immutable. All calls to the outside world are isolated. Want to hit the server? Want to call a JavaScript library? That happens through a port. Ports are declared in the program’s main module, so they can never hide deep in the bowels of components. Logic is in one place (Elm), interactions in another.

one section (Elm) has business logic and is data-in, data-out. It has little ports to another section( JavaScript) that can read input, write files, draw UI. That section blurs into the whole world, including the user.

This post describes a static Elm program with one tiny port to the outside world. It illustrates the structure of a static page in Elm. Code is here, and you can see the page in action here. The program parses the parameters in the URL’s query string and displays them in an HTML table.[1]

All web pages start with the HTML source:

  URL Parameters in Elm
  <script src="elm.js” type=”text/javascript”>

  var app = Elm.fullscreen(Elm.UrlParams,
                           { windowLocationSearch:

This brings in my compiled Elm program and some CSS. Then it calls Elm’s function to start the app, giving it the name of my module which contains main, and extra parameters, using JavaScript’s access to the URL search string.

Elm looks for the main function in my module. The output of this function can be a few different types, and this program uses the simplest one: Html. This type is Elm’s representation of HTML output, its virtual DOM.

module UrlParams where

import ParameterTable exposing (view, init)
import Html exposing (Html)

main : Html
main = view (init windowLocationSearch)

port windowLocationSearch : String

The extra parameters passed from JavaScript arrive in the windowLocationSearch port. This is the simplest kind of port: input received once at startup. Its type is simply String. This program uses one custom Elm component, ParameterTable. The main function uses the component’s view function to render, and passes it a model constructed by the component’s init method.

Somewhere inside the JavaScript call to Elm.fullscreen, Elm calls the main function in UrlParams, converts the Html output into real DOM elements, and renders that in the browser. Since this is a static application, this happens once. More interesting Elm apps have different return types from main, but that’s another post.

From here, the data flow of this Elm program looks like this:

The three layers are: a main module, a component, and a library of functions.
The main module has one input port for the params.  That String is transformed by init into a Model, which is transformed by View into Html. The Html is returned by main and rendered in the browser. This is the smallest useful form of the Elm Architecture that I came up with.

Here’s a piece of the ParameterTable module:

module ParameterTable(view, init) where

import Html exposing (Html)
import UrlParameterParser exposing (ParseResult(..), parseSearchString)

type alias Model = { tableData: ParseResult }

init: String -> Model
init windowLocationSearch =
  { tableData = parseSearchString windowLocationSearch }

viewModel -> Html
view model =
  Html.div …

The rest of the code has supporting functions and details of the view. These pieces (Model, init, and view) occur over and over in Elm. Often the Model of one component is composed from the Models of subcomponents, and the same with init and view functions.[2]

All the Elm files are transformed by elm-make into elm.js. Then index.html imports elm.js and calls its Elm.fullscreen function, passing UrlParams as the main module and in the extra parameter. And so, a static (but not always the same) web page is created from data-in, data-out Elm functions. And I am a happy programmer.

[1] Apparently there’s not a built-in thing in JavaScript for parsing these. Which is shocking. I refused to write such a thing in JavaScript (where by “write” I mean “copy from StackOverflow”), so I wrote it in Elm.

[2] Ditto with update and Action, but that’s out of scope. This post is about a static page.

Data-in, Data-out

In functional programming, we try to keep our functions data-in, data-out: they take some data as parameters, return some data as output, and that’s it. Nothing else. No dialog boxes pop, no environment variables are read, no database rows are written, no files are accessed. No global state is read or written. The output of the function is entirely determined by the values of its input. The function is isolated from the world around it.

A data-in, data-out function is highly testable, without complicated mocking. The test provides input, looks at the output, and that’s all that it needs for a complete test.[1]

A data-in, data-out function is pretty well documented by its declaration; its input types specify everything necessary for the function to work, its output type specifies the entire result of calling it. Give the function a good name that describes its purpose, and you’re probably good for docs.

It’s faster to comprehend a data-in, data-out function because you know a lot of things it won’t do. It won’t go rooting around in a database. It won’t interrupt the user’s flow. It won’t need any other program to be running on your computer. It won’t write to a file[2]. All these are things I don’t have to think about when calling a data-in, data-out function. That leaves more of my brain for what I care about.

If all of our code was data-in, data-out, then our programs would be useless. They wouldn’t do anything observable. However, if 85% of our code is data-in, data-out, with some input-gathering and some output-writing and a bit of UI-updating — then our program can be super useful, and most of it still maximally comprehensible. Restricting our code in this way when we’re writing it provides more clarity when we’re reading it and freedom when we’re refactoring it.

Think about data-in, data-out while you’re coding; make any dependencies on the environment and effects on the outside world explicit; and write most of your functions as transformations of data. This gets you many of the benefits of functional programming, no matter what language you write your code in.

[1] Because the output is fixed for a given input, it would be legit to substitute the return value for the function-call-with-that-input at any point. Like, one could cache the return values if that helped with performance, because it’s impossible for them to be different next time, and it’s impossible to notice that the function wasn’t called because calling it has no externally-observable effect. Historically, this property is called referential transparency.

[2] We often make an exception for logging, especially logging that gets turned off in production.

Ultratestable Coding Style

Darn side-effecting programs. Programs that change things in the outside world are so darn useful, and such a pain to test.
what's better than green? Ultra!For every piece of code, there is another piece of code that answers the question, “How do I know that code works?” Sometimes that’s more work than the code itself — but there is hope.

The other day, I made a program to copy some code from one project to another – two file copies, with one small change to the namespace declaration at the top of each file. Sounds trivial, right?

I know better: there are going to be a lot of subtleties. And this isn’t throwaway code. I need good, repeatable tests.

Where do I start? Hmm, I’ll need a destination directory with the expected structure, an empty source directory, files with the namespace at the top… oh, and cleanup code. All of these are harder than I expected, and the one test I did manage to write is specific to my filesystem. Writing code to verify code is so much harder than just writing the code!

Testing side-effecting code is hard. This is well established. It’s also convoluted, complex, generally brittle.
The test process looks like this:

input to code under test to output, but also prep the files in the right place and clear old files out, then the code under test does read & write on the filesystem, then check that the files are correct

Before the test, create the input AND go to the filesystem, prepare the input and the spot where output is expected.
After the test, check the output AND go to the filesystem, read the files from there and check their contents.
Everything is intertwined: the prep, the implementation of the code under test, and the checks at the end. It’s specific to my filesystem. And it’s slow. No way can I run more than a few of these each build.

The usual solution to this is to mock the filesystem. Use a ports-and-adapters approach. In OO you might use dependency injection; in FP you’d pass functions in for “how to read” and “how to write.” This isolates our code from the real filesystem. Test are faster and less tightly coupled to the environment. The test process looks like this:

Before the test, create the input AND prepare the mock read results and initialize the mock for write captures.
After the test, check the output AND interrogate the mock for write captures.

It’s an improvement, but we can do better. The test is still convoluted. Elaborate mocking frameworks might make it cleaner, but conceptually, all those ties are still there, with the stateful how-to-write that we pass in and then ask later, “What were your experiences during this test?”

If I move the side effects out of the code under test — gather all input beforehand, perform all writes afterward — then the decisionmaking part of my program becomes easier and more clear to test. It can look like this (code):

The input includes everything my decisions need to know from the filesystem: the destination directory and list of all files in it; the source directory and list plus contents of all files in it.
The output includes a list of instructions, for the side effects the code would like to perform. This is super easy to check at the end of a test.

The real main method looks different in this design. It has to gather all the input up front[1], then call the key program logic, then carry out the instructions. In order to keep all the decisionmaking, parsing, etc in the “code under test” block, I keep the interface to that function as close as possible to that of the built-in filesystem-interaction commands. It isn’t the cleanest interface, but I want all the parts outside “code-under-test” to be trivial.

simplest possible code to gather input, to well-tested code that makes all the decisions, to simplest-possible code to carry out instructions.

With this, I answer “How do I know this code works?” in two components. For the real-filesystem interactions, the documentation plus some playing around in the REPL tell me how they work. For the decisioning part of the program, my tests tell me it works. Manual tests for the hard-to-test bits, lots of tests for the hard-to-get-right bits. Reasoning glues them together.

Of course, I’m keeping my one umbrella test that interacts with the real filesystem. The decisioning part of the program is covered by poncho tests. With an interface like this, I can write property-based tests for my program, asserting things like “I never try to write a file in a directory that doesn’t exist” and “the output filename always matches the input filename.”[2]

As a major bonus, error handling becomes more modular. If, on trying to copy the second file, it isn’t found or isn’t valid, the second write instruction is replaced with an “error” instruction. Before any instructions are carried out, the program checks for “error” anywhere in the list (code). If found, stop before carrying out any real action. This way, validations aren’t separated in code from the operations they apply to, and yet all validations happen before operations are carried out. Real stuff happens only when all instructions are possible (as far as the program can tell). It’s close to atomic.

There are limitations to this straightforward approach to isolating decisions from side-effects. It works for this program because it can gather all the input, produce all the output, and hold all of it in memory at the same time. For a more general approach to this same goal, see Functional Programming in Scala.

Moving all the “what does the world around me look like?” side effects to the beginning of the program, and all the “change the world around me!” side effects to the end of the program, we achieve maximum testability of program logic. And minimum convolution. And separation of concerns: one module makes the decisions, another one carries them out. Consider this possibility the next time you find yourself in testing pain.

The code that inspired this approach is in my microlib repository.
Interesting bits:
Umbrella test (integration)
Poncho tests (around the decisioning module) (I only wrote a few. It’s still a play project right now.)
Code under test (decisioning module)
Main program
Instruction carrying-out part

Diagrams made with Monodraw. Wanted to paste them in as ASCII instead of screenshots, but that’d be crap on mobile.

[1] This is Clojure, so I put the “contents of each file” in a delay. Files whose contents are not needed are never opened.
[2] I haven’t written property tests, because time.

Readable, or reason-aboutable?

My coworker Tom finds Ruby unreadable.

What?? I’m thinking. Ruby can be quite expressive, even beautiful.

But Tom can’t be sure what Ruby is going to do. Some imported code could be modifying methods on built-in classes. You can never be sure exactly what will happen when this Ruby code executes.

He’s right about that. “Readable” isn’t the word I’d use though: Ruby isn’t “reason-aboutable.” You can’t be completely sure what it’s going to do without running it. (No wonder Rubyists are such good testers.)

Tom agreed that Ruby could be good at expressing the intent of the programmer. This is a different goal from knowing exactly how it will execute.

Stricter languages are easier to reason about. In Java I can read the specification and make inferences about what will happen when I use the built-in libraries. In Java, I hate the idea of bytecode modification because it interferes with that reasoning.

With imperative code in Java or Python, where what you see it what you get, you can try to reason about these by playing compiler. Step through what the computer is supposed to do at each instruction. This is easier when data is immutable, because then you can trace back to the one place it could possibly be set.

Beyond immutability, the best languages and libraries offer more shortcuts to reasoning. Shortcuts let you be sure about some things without playing compiler through every possible scenario. Strong typing helps with this: I can be sure about the structure of what the function returns, because the compiler enforces it for me.

Shortcuts are like, I can tell 810 is divisible by 3 because its digits add to a number divisible by 3. I don’t have to do the division. This is not cheating, because this is not coincidence; someone has proven this property mathematically.

Haskell is the most reason-aboutable language, because you can be sure that the environment won’t affect execution of the code, and vice-versa, outside of the IO monad. Mathematical types like monoids and monads help too, because they come with properties that have been proven mathematically. Better than promises in the documentation. More scalable than playing compiler all day.[1]

“Readability” means a lot of things to different people. For Tom, it’s predictability: can he be sure what this code will do? For many, it’s familiarity: can they tell at a blink what this code will do? For me, it’s mostly intent: can I tell what we want this code to do?

Haskellytes find Haskell the most expressive language, because it speaks to them. Most people find it cryptic, with its terse symbols. Ruby is well-regarded for expressiveness, especially in rich DSLs like RSpec.

Is expressiveness (of the intent of the programmer) in conflict with reasoning (about program execution)?

[1] “How do you really feel, Jess?”

We want to keep our programs simple, and avoid unnecessary complexity. The definition of a complex system is: the fastest way to find out what will happen is to run it. This means Ruby is inviting complexity, compared to Haskell. Functional programmers aim for reason-aboutable code, using all the shortcuts (proven properties) to scale up our thinking, to fit more in our head. Ruby programmers trust inferences made from example tests. This is easier on the brain, both to write and read, for most people. It is not objectively simpler.

Quick reference: monads and test.check generators

Combine monads with test.check generators to build them up out of smaller generators with dependencies:

(require ‘[clojure.test.check.generators :as gen])
(require ‘[clojure.algo.monads :as m])
(m/defmonad gen-m 
  [m-bind gen/bind 
   m-result gen/return])

(def vector-and-elem
  (m/domonad gen-m
    [n (gen/choose 1 10)
     v (gen/vector gen/int n)
     e (gen/element v)]
    [v, e]))

(gen/sample vector-and-elem)
;; ([[0 0] 0] 
    [[0 -1 1 0 -1 0 -1 1] 0] 
    [[1 1 3 3 3 -1 0 -2 2] 3]
    [[8 4] 8]…

The generator here chooses a vector length, uses that to generate a vector, uses that to pick an element inside the vector, and then returns a tuple of the vector and the element. The syntax is cleaner than a lot of gen/bind and gen/fmap calls. It looks a lot like ScalaCheck.

I suspect we could define m-zero and m-plus in the monad to get :when conditions as well.

I’m working on a longer post that explains what’s going on here and why we would do it.

Declarative style in Java

The other day at work, we had this problem:

Sort a sequence of rows, based on an input list of columns, where each column might be sorted ascending or descending, and each column might require a transformation first to get to something comparable.

The function interface looks like this:

Here is a concrete example where we sort food items first by their recommended meal (with a transformation that says “breakfast is first, then lunch, then dinner”),  and then by food name in reverse order:

What is the absolutely stupidest way to do this?

The stupidest way to do this is to write your own sort, that happens to order rows by looping through the instruction set. But we don’t do it that way. We know very well that sort is hard, and there are library algorithms to do it. Put the ordering logic in a custom Comparator and pass that into sort.

In Java, that looks like

Collections.sort(input, comparator);

This is a declarative style: sort this, to this specification. It separates the algorithm of sorting from the business logic of which row belongs before which other row. A professional Java developer would not consider rewriting sort; they wrap the logic in a Comparator and pass it in.

Heck, even in the C standard libraries, qsort accepts a function pointer. Sort is hard enough that we don’t complain about passing in comparison logic, however awkward.

As opposed to filter, which we rewrite every time we need part of a list. If I wanted just the breakfast foods, I might write

List breakfastFoods = new ArrayList();
for (item : allFoods) {
  if(item.getMeal().equals(“Breakfast”)) {
return breakfastFoods;

which would be completely imperative, describing the filter algorithm and even allocation and mutation of a new object. Yet this is easy enough that we used to do it all the time. Fortunately Java 8 fixes this:
  .filter(item -> item.getMeal().equals(“Breakfast”));

Or in Java 6:

Iterables.filter(allFoods, BREAKFAST_PREDICATE);[1]

Idiomatic Java is moving further and further toward a declarative style, and that’s a beautiful thing. While filter may be tremendously simpler than sort, but it’s still repetition, and re-implementing it every time is still limiting. The filter method on Stream can perform the operation in parallel, it can weave it in with other stream operations and skip instantiating intermediate lists, it can be lazy and do less filtering — there are many possibilities when we let go of “how” something works and specify only what it must do.

If your co-workers complain about the new ways of thinking, ask them whether they’d rewrite sort. It’s all a matter of degree.

[1] This imports the Guava library and defines a constant:

final Predicate BREAKFAST_PREDICATE = new Predicate() {
   public boolean apply(FoodRow item) {
     return item -> item.getMeal().equals(“Breakfast”);

Left to right, top to bottom

TL;DR – Clojure’s threading macro keeps code in a legible order, and it’s more extensible than methods.

When we create methods in classes, we like that we’re grouping operations with related data. It’s a useful organizational scheme. There’s another reason to like methods: they put the code in an order that’s easy to read. In the old days it might read top-to-bottom, with subject and then verb and then the objects of the verb:

With a fluent interface that supports immutability, methods still give us a pleasing left-to-right ordering:
Methods look great, but it’s hard to add new ones. Maybe I sometimes want to add functionality for returns, or print a gift receipt. With functions, there is no limit to this. The secret is: methods are the same thing as functions, except with an extra secret parameter called this
For example, consider JavaScript. (full gist) A method there can be any old function, and it can use properties of this.

completeSale = function(num) {
console.log("Sale " + num + ": selling " 

+ this.items + " to " + this.customer);

Give that value to an object property, and poof, the property is a method:

var sale = {

customer: "Fred",

items: ["carrot","eggs"],

complete: completeSale

// Sale 99: selling carrot,eggs to Fred

Or, call the function directly, and the first argument plays the role of “this”:, 100)
// Sale 100: selling carrot,eggs to Fred
In Scala we can create methods or functions for any operation, and still organize them right along with the data. I can choose between a method in the class:
class Sale(…) {
   def complete(num: Int) {…}
or a function in the companion object:
object Sale {
   def complete(sale: Sale, num: Int) {…}
Here, the function in the companion object can even access private members of the class[1]. The latter style is more functional. I like writing functions instead of methods because (1) all input is explicit and (2) I can add more functions as needed, and only as needed, and without jumbling up the two styles. When I write functions about data, instead of attaching functions to data, I can import the functions I need and no more. Methods are always on a class, whether I like it or not.
There’s a serious disadvantage to the function-with-explicit-parameter choice, though. Instead of a nice left-to-right reading style, we get:

It’s all inside-out-looking! What happens first is in the middle, and the objects are separated from the verbs they serve. Blech! It sucks that function application reads inside-out, right-to-left. The code is hard to follow.

We want the output of addCustomer to go to addItems, and the output of addItems to go to complete. Can I do this in a readable order? I don’t want to stuff all my functions into the class as methods.
In Scala, I wind up with this:

Here it reads top-down, and the arguments aren’t spread out all over the place. But I still have to draw lines, mentally, between what goes where. And sometimes I screw it up.

Clojure has the ideal solution. It’s called the threading macro. It has a terrible name, because there’s no relation to threads, nothing asynchronous. Instead, it’s about cramming the output of one function into the first argument of the next. If addCustomer, addItems, and complete are all functions which take a sale as the first parameter, the threading macro says, “Start with this. Cram it into first argument of the function call, and take that result and cram it into the first argument of the next function call, and so on.” The result of the last operation comes out. (full gist

\\ Sale 99 : selling [carrot eggs] to Fred

This has a clear top-down ordering to it. It’s subject, verb, object. It’s a great substitute for methods. It’s kinda like stitching the data in where it belongs, weaving the operations together. Maybe that’s why it’s called the threading macro. (I would have called it cramming instead.)

Clojure’s prefix notation has a reputation for being harder to read, but this changes that. The threading macro pulls the subject out of the first function argument and puts it at the top, at the beginning of the sentence. I wish Scala had this!
In case you’re still interested, here’s a second example: list processing.

Methods in Scala look nice:

but they’re not extensible. If these were functions I’d have:

which is hideous. So I wind up with:
That is easy to mess up; I have to get the intermediate variables right.
In Haskell it’s function composition:
That reads backwards, right-to-left, but it does keep the objects with the verbs.

Notice that in Haskell the map, filter, reduce functions take the data as their last parameter.[2] This is also the case in Clojure, so we can use the last-parameter threading macro. It has the cramming effect of shoving the previous result into the last parameter:

Once again, Clojure gives us a top-down, subject-verb-object form. See? the Lisp is perfectly readable, once you know which paths to twist your brain down.

Update: As @ppog_penguin reminded me, F# has the best syntax of all. Its pipe operator acts a lot like the Unix pipe, and sends data into the last parameter.
F# is my favorite!
[1] technical detail: the companion object can’t see members that are private[this]
[2] technical detail: all functions in Haskell take one parameter; applying map to a predicate returns a function of one parameter that expects the list.