Ultratestable Coding Style

Darn side-effecting programs. Programs that change things in the outside world are so darn useful, and such a pain to test.
what's better than green? Ultra!For every piece of code, there is another piece of code that answers the question, “How do I know that code works?” Sometimes that’s more work than the code itself — but there is hope.

The other day, I made a program to copy some code from one project to another – two file copies, with one small change to the namespace declaration at the top of each file. Sounds trivial, right?

I know better: there are going to be a lot of subtleties. And this isn’t throwaway code. I need good, repeatable tests.

Where do I start? Hmm, I’ll need a destination directory with the expected structure, an empty source directory, files with the namespace at the top… oh, and cleanup code. All of these are harder than I expected, and the one test I did manage to write is specific to my filesystem. Writing code to verify code is so much harder than just writing the code!

Testing side-effecting code is hard. This is well established. It’s also convoluted, complex, generally brittle.
The test process looks like this:

input to code under test to output, but also prep the files in the right place and clear old files out, then the code under test does read & write on the filesystem, then check that the files are correct

Before the test, create the input AND go to the filesystem, prepare the input and the spot where output is expected.
After the test, check the output AND go to the filesystem, read the files from there and check their contents.
Everything is intertwined: the prep, the implementation of the code under test, and the checks at the end. It’s specific to my filesystem. And it’s slow. No way can I run more than a few of these each build.

The usual solution to this is to mock the filesystem. Use a ports-and-adapters approach. In OO you might use dependency injection; in FP you’d pass functions in for “how to read” and “how to write.” This isolates our code from the real filesystem. Test are faster and less tightly coupled to the environment. The test process looks like this:

Before the test, create the input AND prepare the mock read results and initialize the mock for write captures.
After the test, check the output AND interrogate the mock for write captures.

It’s an improvement, but we can do better. The test is still convoluted. Elaborate mocking frameworks might make it cleaner, but conceptually, all those ties are still there, with the stateful how-to-write that we pass in and then ask later, “What were your experiences during this test?”

If I move the side effects out of the code under test — gather all input beforehand, perform all writes afterward — then the decisionmaking part of my program becomes easier and more clear to test. It can look like this (code):

The input includes everything my decisions need to know from the filesystem: the destination directory and list of all files in it; the source directory and list plus contents of all files in it.
The output includes a list of instructions, for the side effects the code would like to perform. This is super easy to check at the end of a test.

The real main method looks different in this design. It has to gather all the input up front[1], then call the key program logic, then carry out the instructions. In order to keep all the decisionmaking, parsing, etc in the “code under test” block, I keep the interface to that function as close as possible to that of the built-in filesystem-interaction commands. It isn’t the cleanest interface, but I want all the parts outside “code-under-test” to be trivial.

simplest possible code to gather input, to well-tested code that makes all the decisions, to simplest-possible code to carry out instructions.

With this, I answer “How do I know this code works?” in two components. For the real-filesystem interactions, the documentation plus some playing around in the REPL tell me how they work. For the decisioning part of the program, my tests tell me it works. Manual tests for the hard-to-test bits, lots of tests for the hard-to-get-right bits. Reasoning glues them together.

Of course, I’m keeping my one umbrella test that interacts with the real filesystem. The decisioning part of the program is covered by poncho tests. With an interface like this, I can write property-based tests for my program, asserting things like “I never try to write a file in a directory that doesn’t exist” and “the output filename always matches the input filename.”[2]

As a major bonus, error handling becomes more modular. If, on trying to copy the second file, it isn’t found or isn’t valid, the second write instruction is replaced with an “error” instruction. Before any instructions are carried out, the program checks for “error” anywhere in the list (code). If found, stop before carrying out any real action. This way, validations aren’t separated in code from the operations they apply to, and yet all validations happen before operations are carried out. Real stuff happens only when all instructions are possible (as far as the program can tell). It’s close to atomic.

There are limitations to this straightforward approach to isolating decisions from side-effects. It works for this program because it can gather all the input, produce all the output, and hold all of it in memory at the same time. For a more general approach to this same goal, see Functional Programming in Scala.

Moving all the “what does the world around me look like?” side effects to the beginning of the program, and all the “change the world around me!” side effects to the end of the program, we achieve maximum testability of program logic. And minimum convolution. And separation of concerns: one module makes the decisions, another one carries them out. Consider this possibility the next time you find yourself in testing pain.


The code that inspired this approach is in my microlib repository.
Interesting bits:
Umbrella test (integration)
Poncho tests (around the decisioning module) (I only wrote a few. It’s still a play project right now.)
Code under test (decisioning module)
Main program
Instruction carrying-out part

Diagrams made with Monodraw. Wanted to paste them in as ASCII instead of screenshots, but that’d be crap on mobile.


[1] This is Clojure, so I put the “contents of each file” in a delay. Files whose contents are not needed are never opened.
[2] I haven’t written property tests, because time.

Declarative style in Java

The other day at work, we had this problem:

Sort a sequence of rows, based on an input list of columns, where each column might be sorted ascending or descending, and each column might require a transformation first to get to something comparable.

The function interface looks like this:

Here is a concrete example where we sort food items first by their recommended meal (with a transformation that says “breakfast is first, then lunch, then dinner”),  and then by food name in reverse order:

What is the absolutely stupidest way to do this?

The stupidest way to do this is to write your own sort, that happens to order rows by looping through the instruction set. But we don’t do it that way. We know very well that sort is hard, and there are library algorithms to do it. Put the ordering logic in a custom Comparator and pass that into sort.

In Java, that looks like

Collections.sort(input, comparator);

This is a declarative style: sort this, to this specification. It separates the algorithm of sorting from the business logic of which row belongs before which other row. A professional Java developer would not consider rewriting sort; they wrap the logic in a Comparator and pass it in.

Heck, even in the C standard libraries, qsort accepts a function pointer. Sort is hard enough that we don’t complain about passing in comparison logic, however awkward.

As opposed to filter, which we rewrite every time we need part of a list. If I wanted just the breakfast foods, I might write

List breakfastFoods = new ArrayList();
for (item : allFoods) {
  if(item.getMeal().equals(“Breakfast”)) {
     breakfastFoods.add(item);
  }
}
return breakfastFoods;

which would be completely imperative, describing the filter algorithm and even allocation and mutation of a new object. Yet this is easy enough that we used to do it all the time. Fortunately Java 8 fixes this:

allFoods.stream()
  .filter(item -> item.getMeal().equals(“Breakfast”));

Or in Java 6:

Iterables.filter(allFoods, BREAKFAST_PREDICATE);[1]

Idiomatic Java is moving further and further toward a declarative style, and that’s a beautiful thing. While filter may be tremendously simpler than sort, but it’s still repetition, and re-implementing it every time is still limiting. The filter method on Stream can perform the operation in parallel, it can weave it in with other stream operations and skip instantiating intermediate lists, it can be lazy and do less filtering — there are many possibilities when we let go of “how” something works and specify only what it must do.

If your co-workers complain about the new ways of thinking, ask them whether they’d rewrite sort. It’s all a matter of degree.

——–
[1] This imports the Guava library and defines a constant:

final Predicate BREAKFAST_PREDICATE = new Predicate() {
   public boolean apply(FoodRow item) {
     return item -> item.getMeal().equals(“Breakfast”);
   }
}

Trains within trains

In Java, there are primitive types and there are objects. Often we want to work with everything as an object, so those primitives get boxed up into object wrappers.
Ruby and Scala say, “That’s silly. Let everything be an object to begin with.” That keeps parameter-passing semantics and comparison and printing operations consistent.

Yesterday while writing code, I thought I was working with a sequence, so I wrote

input map { transformation } map { transformation2 }

Then the compiler complained, because my input was a simple value. Value? pshaw. I’ve clearly gotten used to working with Functors. As that last post says, functors are like trains that hold a value. You can attach transformations like extra train cars, and the result is still one train.

 

So I wrapped my value in an Option and went about my business.

Some(input) map { transformation } map { transformation2 }

Option is a functor, a context that can hold a value and incorporate transformations. However, Option is more than that. As a context, it has a special power: it might hold no value at all.

The map method on Option doesn’t let me use that special power. Map works within the context in that transforming an empty Option has no effect, but map won’t let me switch from a full Option to an empty Option. If I have a transformation that fails, I’d want the Option to turn out empty.

This is the flatmap method: it might map it, or it might flat it…. OK, that doesn’t make any sense. That’s the name of the method, anyway.[0]

As a train, flatmap says, “Instead of telling you what the next car is, I’ll wait for you to give me the value and then I’ll give you a whole new train.” 

The net result is still exactly one train. Unlike mapflatmap lets us use the special power of the context.

List (or Seq) is another functor with a special power: it holds 0 to many values. map transforms values, but flatmap takes advantage of the special power, so the result can contain a different number of values. Maybe twice as many, maybe none.

One more example: Future holds a value may not exist yet. Future.flatmap lets you produce another value that may not exist yet, and the result is still one Future

Each context has a special power, and flatmap lets you exercise that power a second time, always returning one context. The context doesn’t get deeper. You don’t get an Option of an Option back from flatmap; you still have one layer of Option around a value.

The Option train probably evaluates your trains-formation immediately, and gives you back a Some or None right away. A lazy List or a Future might store the instruction for later, or execute it right away — the important thing is you can’t tell the difference.

You know where this is going, right?

A monad is a context with a special power, and a method flatmap [1] that lets you use its special power again and again.

There’s one more consistent feature of all monads: they all provide a constructor that lets you put a value in the context. For Option, that’s Some(x). There’s List(x), and Future.now(x).[2] It says, “if you have a value, you can put it on a train.”

What they DON’T have in common is a way to pull values OUT of the train. This is different for every monad, if it exists at all. We have Option.get, and List can be accessed by index. Future doesn’t really want you to ever get the thing out; if you insist, Await can dig it out for you.

Since you have to do something different to get the value out every time, what’s the point of naming them “monad” at all? It turns out there are things we can do to “anything inside a monad.” One is to build up a computation without (necessarily) running it right away. Right now I’m looking at Process in scalaz-stream, which does this. Yet, many things you now do to a List could be done to a monad in general. You can always map or flatmap, without knowing what special powers you’re working with.

Ever write a function to operate on a sequence of its arguments rather than just one, so that you can call it either way? Take a function that saves items to a database. It can save one or many; maybe you even made it accept a variable argument list. What if, instead, you wrote that function to operate on any monad containing items? Then different callers could pass multiple items in a List, a possibly-nonexistent item in an Option, or an item that hasn’t been determined yet in a Future. The Option would execute the database-inserting action immediately, while the Future would save it for later. Your function returns the result of the insertion in that same context. Now that’s accommodating![3]
If the caller really does want to pass in exactly one item, she could wrap it in the Boring monad that has no special powers at all.[4] Why, with that, I could work in monadic style all the time! map and flatmap all the things!

It sounds crazy now, but hey, we used to manually box up our primitives in Java, too.

——————————–
[0] I can think of a few ways to interpret “flatmap” as a reasonable name for the method, but it’ll be more fun to let people comment.

[1] Haskellers call this flatmap method “bind.” You can combine one within another without getting any deeper, and if you can combine two you can combine as many as you like: therefore, monads are composable.

[2] Haskellers call the single-value constructor “unit.”

[3] technically you could accept any functor, I think.

[4] I totally made that up.

Program relativity: F = d*c^2

Your home and your bank account: one of these is an asset, and one is money. Enter the real estate market, and they become interchangeable, with substantial effort.
Matter and energy: they’re the same at some level. If it were easy to go back and forth, we’d all move around at the speed of light.
Code and data: one of these we expect to vary with every run of our program. Perhaps if treat them as the same, if we pass code around and massage it the way we do data, then our programs can change faster.
If you’re like “Code is data, OK yeah, it’s bytes in a text file,” then you’re selling your house brick by brick. We pass code around at a conceptual level in functional programming. We work with operations; values feel like the special case.
Here’s a tiny example I hit at work today.
Say we have a database, a list of groups, and a countActiveUsers method on the GroupOperations object. We could write that method to operate on the database, using the list of groups, to return the count all the currently-online users in those groups.
def countUsers(groups: Set[Group], db: DB) : Int
Now say I’m starting up a StatusUpdateActor, and I want to give it a way to count the users. The easiest, crappiest way is to pass in all three objects:
new StatusUpdateActor(db, groups, groupOperations)
A cleaner way is to create a closure and pass that in:
val countUsersOperation: () => Int = { groupOperations.countUsers(groups, db) }
new StatusUpdateActor(countUsersOperation)
It can be cleaner still. Consider – what is the GroupOperations class’s job? It knows how to do stuff with groups. Knowing how to do stuff doesn’t mean you’re the one to do it all the time. What if countUsers returned an operation instead of performing an operation?
def countUsers(groups: Set[Group], db: DB) : () => Int
Then we wouldn’t have to wrap it in a closure to pass it in to StatusUpdateActor. It’s right there, an operation. If we want to run it right now, we can. If we want to pass it around, we can.
Better yet, what if I haven’t decided what db to use? Why should I have to provide a DB before getting this operation?
def countUsers(groups: Set[Group]): DB => Int
Now I can get an operation on a DB that returns an Int. I can give it the db right away, or maybe StatusUpdateActor gets a db from a different source, and we want to give it the operation directly.
new StatusUpdateActor( dbSource, groupOperations.countUsers(groups) )
Moving from OO toward FP, I see application no longer as the whole point of a function, and more as the final step in a series of manipulations. Why force a caller to wrap your method in a closure sometimes? Return the operation to begin with, and let the caller apply it when they darn well please.
If you like to use efficiency as an excuse, then perhaps countUsers performs a set operation on the groups and constructs a statement before touching the db. Do all this once and return an operation that uses the statement value once a DB passed is in.
When you perform an operation, you give a man a fish. When you return an operation, you teach him how to fish.
This is one tiny example of focusing on operations. It treats code as data at a high level: here’s a function, you can do a few things to it, like supply some or all of its arguments or give it to someone who might call it later. It’s moving the mass of our program around at the speed of data. It’s a liquid market for our program assets.
People have predicted for the last several years that functional programming is a fad — sorry dudes, this is no speculative bubble. Buy in, because prices are only going up.
———————————-
Would it help if I showed code for the methods declared above? 
And declarations for the called ones? 
I thought it might be more distracting than useful.

What FP taught me about OO: Liskov Substitution Principle explained

TL;DR – functional programming taught me that LSP is a special case of PLS: Principle of Least Surprise.

One thing that bugs me while reading Java: I’m reading along, come to a method call, ctrl-click on the method name, and get a list of implementations of the interface. IntelliJ can’t tell me exactly what will run at this point in the code. Java has dynamic dispatch: it decides at runtime which implementation to run.

Functional style has a different concept to offer instead of Java’s interface. It’s called “typeclass,” which is incredibly confusing. Please accept that this idea has little to do with “type” or “class.”

While an interface says, “This class can perform this method,” a typeclass says, “There exists something that can perform this method on this class.” (using ‘class’ loosely) An interface is baked into a class, while a typeclass instance is provided separately.

With typeclasses, the method implementation isn’t attached to the runtime class. Instead, the compiler digs up the typeclass instance, based on a declared type instead of a concrete runtime type. This isn’t flexible in the same way as OO’s dynamic dispatch, but it has the advantage that we can tell by looking at the code exactly which implementation of the method will execute.

This gives us better ability to read the code and figure out what will happen in all circumstances, without using the debugger. Functional programmers call this “reasoning about code.”

In Java, how can we make statements about what the code will do when we don’t know exactly what code will run? One answer is: Liskov Substitution Principle. This OO design principle says, If you inherit from some other class or implement an interface, then be sure your class meets all expectations people have of the superclass or interface. The compiler enforces that our methods have the right signatures, while LSP demands that our implementation have the same invariants, postconditions, and other properties implied the documentation of the superclass or interface.

For instance, if you create an implementation of List with a .length() method that sometimes returns -1, you violate the expectations people have for Lists. This surprises people. Surprises are bad. Surprises stop us from accurately predicting how code will execute.

Therefore, the Liskov Substition Principle says, “Don’t surprise people.” If the documentation of List implies that length() will return a nonnegative integer, always do that. Also make sure that your List is 0-indexed like everyone else’s. To implement an interface is to take on all the implicit and explicit promises made by that interface. That’s LSP.

In Haskell or Scala, we can use typeclasses and be certain what the code will do at runtime. In Java, we can have the flexibility of dynamic dispatch and still have some knowledge of what will happen at runtime as long as we respect the Liskov Substitution Principle.

Twisting the rules of logic in our code

In philosophy, there are very few things that can’t be doubted. The basic laws of logic are among them. There’s one that seems completely obvious and indisputable to normal people:

Law of Identity: Everything is identical to itself.

In my audiobook, the professor is going on about how not only is this statement true in our world, it’s true in every conceivable world. And I’m thinking, “Oh, I can conceive of such a world where this is false! I’ve created one!” The programmer can create worlds the philosopher cannot conceive of.

Take Hibernate for example, which tries to represent a database in Java objects. It’s optimal to define object identity based on a set of fields, rather than creating a gratuitous sequential ID. But woe betide the programmer who includes in the set of identity fields one that can be updated! I change your last name, and bam, your entity is not identical to itself. Been there, done that, learned not to do it again. It takes rigor, knowledge, and discipline to use Hibernate without creating a world that violates this most basic law of logic.

Whaaa? Why is this even possible? This is exactly what functional programmers are talking about with “reasoning about code.” FP is about languages or practices that make sure our programs conform to the laws of logic, so they don’t up and surprise us with objects that are not identical to themselves.

In Java, we can violate the Law of Identity when we define .hashcode() poorly and then stick objects in a HashMap. One key goes in, a property referenced in the key’s .hashcode() changes, and then we try to get that entry out – no luck. It is not identical to itself. Things also get nutty when .equals() and .hashcode() are not dependent on the same fields. In Java, we have to be careful to create objects that follow the Law of Identity.

A Haskell programmer says, ridiculous! Why would we do this to ourselves?

After the Law of Identity, there’s two laws attributed to Liebniz. The first one says identical objects have the same properties. This is a little harder to screw up in Java, unless:

class TrickyBugger {
  public double getHappiness() { return Math.random; }
}

The second form of Liebniz’s law says if two things have the same properties, then they are identical. In reality and philosophy this law is controversial. Yet, when it’s true, it’s useful. In programming we can choose to keep this one true.

Enter the Algebraic Data Type. This is every type in Haskell, and the most useful sort of data type in Scala. A class is an algebraic data type if, when two instances have identical properties, they are considered identical. (All the properties are immutable.) In OO, the Flyweight pattern is an example of this. The concept of ADTs means that even if two instances live at different places in memory, when they have all the same properties, they might as well be the same instance for all we care.

This is useful because it keeps things simple in my head. I don’t have to ask myself “should I use == or .equals in this comparison?” It’s always .equals(), because properties determine identity. Back in Hibernate, this is the concept I needed. Keys should be ADTs, so naturally their fields will never update.

Programming is supposed to be the epitome of logical work. I went into programming because physics wasn’t deterministic enough. Yet, here we are, in super-popular languages and frameworks, violating the most basic laws of logical inference! Programming is deterministic, and we can choose to create modules that conform to the logical principles our brain expects. Let’s do that.

Dataflow in Ruby

Our job is not to write software. Our job is to turn data into information. That we do that through software is an implementation detail. — Dan North, ScanDev 2013.

Two things about data these days: there’s a lot of it, and a lot of it is crap. How can we get information out of that?

I like to think of my code in terms of a pipeline for data, transforming and working it and summarizing it into information. In order to handle growing amounts of data, we can’t process it all at once; we have to handle it piece by piece as it comes in. Each piece gets summarized in all the different ways and then discarded, so we never run out of memory.

With lazy Enumerables (known in other languages as Iterables), it’s easy enough to accommodate the massive data, as it deals with one item at a time. However, every time you run a reduce over the output, the entire input is read again. How can we perform multiple summaries over the same flow of data as it comes through?

My goal: process data lazily, with multiple reductions of various subsets, without mutating state. In Ruby, because Ruby is like Play-Doh for programmers.

My code is in a branch of my fp4rd repo (that’s Functional Principles for Ruby Development). This post proceeds with an explanation of the example problem and then how the solution works.

If you can make it to Ruby Midwest this weekend or Windy City Rails in September, come hear about functional principles for Ruby developers. This little project is the extreme endpiece of that talk.

Problem statement

In Chapter 3 of the Pickaxe, there’s an example of a program that parses book-inventory CSV data like this:

Title,ISBN,Price
To Whom It May Concern,48-2342-182u32,98.56
What Is Your Problem,2938-123883-13,13.99

Of course, being data, it doesn’t all look like that. Some of it is completely unparsable:

Title,That Number Under The Barcode,Amount
George of the Jungle,234-34-,99.44

and some of it is parsable but missing information:

Title,ISBN,Price
Your Mother Was a Lizard,,32.99
Your Father Stank of Elderberries,234-2145-ldk-234,

My program will parse these files. It totals the price of all books that are parsable and have a price; it also counts these, and it counts the lines that were read but not totaled.

Crazy-looking solution

Code is here, and here’s the meat of it.

pipe = Pipeline::Pipe.new.
  expand(printing.(“— Reading file…”,&read_all_lines)).
  through(printing.(“1. Converting book”,&convert_row_to_book)).
  through(printing.(“2. Checking price”,&reject_no_price)).
  split(
    invalid: Pipeline::Pipe.new.keeping(->(a){a.invalid?}).count,
    valid: Pipeline::Pipe.new.keeping(printing.(“3a. Checking book”, ->(a){a.book?})).
      split( count: Pipeline::Pipe.new.count,
             total: Pipeline::Pipe.new.
        through(printing.(“3b. Extracting book”, ->(a){a.book})).
        through(printing.(“4. Pricing”,->(a){a.price})).
        answer(Pipeline::Monoid.plus)
      )
  )

result = pipe.flow(ARGV)

totalPrice = result.value(:valid, :total)
validCount = result.value(:valid, :count)
errorCount = result.value(:invalid)

What is it doing?
It’s setting up a pipeline, a pipeline with three outlets.
It pushes some data through.
It follows the three routes to get the answer at each end.

It’s going like this:

Files from ARGV go through one at a time. They get expanded into lines, which go through a transformation into books. Data is sent through all the routes of each split. Some gets filtered out by “keeping“.[1]

Crazy internals

As the pipe is constructed, it builds up a list of functions. Each function is an iteratee.

Let’s talk about iterators

Say you have an Enumerable. It represents stuff in a particular order. There are two ways to move through that stuff and do something with it:

External iteration means you control the flow of when to get the next one. Get an external iterator from a Ruby Enumerable by calling .each with no arguments. Then call .next on that whenever you please to get an element. Or a StopIteration exception, when there’s no more data. [2]

> e = [1,2,3].next
> e.next
 => 1

Internal iteration means you say what to do with each element, but the Enumerator controls the process of flipping through them. This is the usual pattern in Ruby.

 > [1,2,3].each { |i| puts i }

Two advantages of internal iteration: no mutating state in my code; the Enumerable can perform cleanup after completion, like closing the input file.
One advantage of external iteration: if I want to stop early, I can. If my objective was to find one book that costs more than $5, controlling when to stop could save a lot of time.

To get the best of both, Oleg Kiselyov came up with this idea of iteratees. You hand the Enumerable a function, but that function doesn’t just return a transformed value (like map).  Instead, your function returns a message, one of:

  • “That’s great man, keep going, and next time do this”
  • “I’m done, and here’s the final result”

If you decide “keep going,” then included in the message is another function: the next piece of data will get passed in to that one. So the function you give the Enumerable gets executed exactly once, and supplies either a result or another function. Higher-order functions in action!

For added message-passing goodness, your function doesn’t always receive data. After the last item in the Enumerable is processed, it sends in End Of File. When you get that, you’d better return a final result.

My iteratees

That’s what happens in the pipeline: each piece of the pipe is an iteratee, which can receive a message with data or an EOF, and returns another piece (for next time) or the final result. For instance, look at the end piece count:

  class CountingEndPiece
    include PieceCommon
    def initialize(soFar = 0)
      @soFar = soFar
    end 
    def eof 
      SimpleResult.new(@soFar)
    end 
    def receive msg 
      CountingEndPiece.new(@soFar + 1)
    end 
  end 

At EOF, it always gives a result. For a piece of data, it always gives back another one of itself with an incremented count.[3]

That was an end piece. What about the middle pieces?

They’re iteratees too, except they delegate to the next piece. A “through” piece does a transformation and then delegates. A “keeping” function delegates only if the predicate is satisfied, otherwise returning itself again, waiting for the next piece of data.

End Construction

The Pipe uses the builder pattern, and the build is triggered by an end piece: answer, count, or split. Once the end is reached, all the pieces can be hooked to each other, and the inlet returned.

Sorry About the Monoids

I couldn’t resist using monoids for answer. Don’t worry about them: they’re a class that defines both a combining-method and a starting point for the reduce. Adding integers and concatenating strings are different monoids because they have different starting points. The starting points are necessary for the empty data case.

Flow

All the data goes in, and is passed to all the iteratees down the line. When everything has returned a Result (which happens at EOF in the implemented cases), then the pipe is resolved.

result = pipe.flow(ARGV)

Catching the output

The pipeline has multiple paths, so follow them using the symbol-labels. That’ll get you to the answer at the end of each pipe.

totalPrice = result.value(:valid, :total)
validCount = result.value(:valid, :count)
errorCount = result.value(:invalid)

Conclusion

There you have it: Iteratees in Ruby, and some sort of dataflow pipeline. This is one way to think about data as a river running through your program without getting washed away. Each piece of incoming data was retained as information: either it was counted, or it was counted and its price totaled.

This is one way to handle a crapton of craptastic data. In Ruby!

——-
[1] I know, I know, “keeping” is filter and “through” is map and “expand” is flatmap. But you know what? These names don’t make that much sense in English. They only make sense to people already familiar with functional programming vocabulary and idioms, and that’s not my primary audience.

[2] Why does Ruby’s external iterator not have some sort of hasNext method? Or is there one and I can’t recognize it?

[3] If I were writing this for production I’d totally keep a stateful count and return itself after each message. But that’s a mutable-state optimization, and this is an exercise in functional style.

From imperative to data flow to functional style

Functional style makes code more maintainable. How? One way makes processing into a flow of data with discrete steps. Another way separates flow from context, making the context swappable. This post illustrates both.

We’ll go from an imperative style to a more and more functional style. Scala is a good language for this illustration, since it supports both. If you want to run this, grab the code from github and :load it in the Scala REPL.

Example: Given a filename, we extract the message from its first line:

The secret message is ‘…’

Watch out for files that don’t exist, are empty, or don’t follow the format. In any of these cases, return None. If everything looks good, we get a SecretMessage containing the quoted text from the file.

case class SecretMessage(meaning: String) 

The type of the function is:

String => Option[SecretMessage]

Note: In good old-fashioned style, this function could return null if it can’t provide a result. I may be starting out imperatively, but not that dirty. Option is better: None represents no result, or Some contains a value. Either way, we get a valid object reference and no NullPointerExceptions.

Note: Since the SecretMessage is input and output as a string, we could encode it as a String. This isn’t specific enough for my taste. I like my types to tell me what the thing is, not just how it is read in or output. 

Imperative

The imperative implementation is straightforward IF you like to play compiler in your head. Open the file, check whether it exists. If it does, read it; check whether it’s empty. If it isn’t, read the first line and then use a regex to check its format and extract the secret message. Nothing weird there, but you have to step through every line to know what the method is doing.

def imperativeStyle(filename : String)
  : Option[SecretMessage]
= {
  val file = new java.io.File(filename)
  if(!file.exists)
    None
  else {
    val source = io.Source.fromFile(file)
    if (!source.hasNext) {
      source.close
      None
    }
    else {
      val firstLine = source.getLines.next
      source.close
      val ExpectedFormat = “The secret message is ‘(.*)'”.r
      firstLine match {
        case ExpectedFormat(secretMessage) =>
          Some(SecretMessage(secretMessage))
        case _ => None
} } } }

Data Flow

Think about what this function is trying to do. Take a filename, get a file (that exists), get its first line (if it isn’t empty), extract a message (if it’s in a certain format). These are three transformations, and each of them might not work. These become three methods:

def openFile(n: String) : Option[java.io.File]
def readFirstLine (f : java.io.File) : Option[String]
def parseLine(line : String) : Option[SecretMessage]

If any of these return None, then our function will return None. A Some return means processing continues.

This makes two pipelines: one where processing continues, and one where the None value is just passed along; only the type parameter changes to meet the required output of the function. This is like the gutter in a bowling alley.

Now, to implement this. The three transformation functions I’ve stuck in a module called transformers; it’s the form of the surrounding function that’s interesting. We’ll walk through four refactors that make it more and more readable.

Minimally, we can change the if statements to call the transforming functions.

def useTransforms(filename: String)
  : Option[SecretMessage]
= {
  import transformers._
  val fileOption = openFile(filename)
  if (fileOption.isEmpty)
    None
  else {
    val lineOption = readFirstLine(fileOption.get)
    if (lineOption.isEmpty)
      None
    else {
      parseLine(lineOption.get)
} } }

This is a little cleaner than the original. The transforming function names call out what each is accomplishing. The separation of these pieces of functionality leads to more lines of code, but now each piece can be tested individually. That’s always good!

Moar functional

Checking isDefined on options is ugly. The more idiomatic way to branch on Option is to use pattern matching.

def patternMatching(filename:String)
  : Option[SecretMessage]
= {
  import transformers._
  openFile(filename) match {
    case None => None
    case Some(file) =>
      readFirstLine(file) match {
        case None => None
        case Some(line) =>
          parseLine(line)
} } }

This is shorter, but it still has a repetitive pattern. With each transformation step, we’re manually coding that gutter flow.

The next trick is to recognize that Option is a collection. It’s a special collection, containing either one or zero items. Then notice that each transformation operates on the single item in the collection returned by the previous step. What method applies a function to items in a collection? map! Except that is not quite right, because map transforms a collection element into another element. Each of our transformers return another collection of zero or one items. What method applies a function to items in a collection and turns each into another collection? flatMap!

def chainOfMaps(filename:String)
  : Option[SecretMessage]
= {
  import transformers._
  openFile(filename).
    flatMap(readFirstLine).
    flatMap(parseLine)
}

Now this is about as short as it gets. The gutter pipeline is encoded in flatMap itself, because a flatMap on None will return another None. How can it get cleaner than this?

There’s another way to process collections in Scala: the for comprehension. it lets us line things up in a more readable fashion, and then it calls flatMap and map. Here, we give names to intermediate values in the pipeline.

def forComprehension(filename:String) : Option[SecretMessage] = {
  import transformers._
  for( file <- openFile(filename);
       line <- readFirstLine(file);
       secretMessage <- parseLine(line))
  yield { secretMessage }
}

In IntelliJ, it lets me hover over these variables names to see their types. In my experience, this style is drastically easier to get right, to debug, and to follow when reading. These benefits are worth a few extra lines and curly braces.

What have we accomplished?

We turned a step-by-step imperative function into a very concise concatenation of three transformations. We removed all manual handling of the gutter pipeline. We broke out the transformations into testable bits.

Now for the cool part.

Here, we’re using the Option class as a context for our data. It is a context that says, “I might have a result, and I might not. If I don’t, roll the ball down the gutter and ignore future transformations.”

Requirements change! The caller of our function wants to know why no secret message was found. Instead of None, return a failure type with a message. Implement this by changing the context the work is done in. The work for this is in another file, if you want to run it yourself.

That context is either a Failure or a Success. It is a generic context, but at the end of our function a Success will contain a SecretMessage, while a Failure contains only an error message.[1]

Change each transformer to output Result instead of Option. They each include a descriptive message upon Failure.

One more step is needed to make Result operate as a context the same way Option does. Scala’s for comprehension will work with any type that implements map and flatMap. Add the declarations to the common Result trait, and then the implementations to each concrete class.

sealed trait Result[A] {
  def flatMap[B](f : A => Result[B]) : Result[B]
  def map[B](f : A => B): Result[B]
}
case class Success[A](value : A) extends Result[A] {
  def flatMap[B](f : A => Result[B]) = f(value)
  def map[B](f : A => B) = Success(f(value))
}

case class Failure[A](message: String) extends Result[A] {
  def flatMap[B](f : A => Result[B] ) = Failure(message)
  def map[B](f : A => B) = Failure(message)
}

Success applies the passed-in functions to the value inside it. Failure always returns another Failure, with the same message but a different type parameter. This is the gutter where errors flow.

Here’s the interesting bit: when it comes to changing our function, we almost don’t. The return type changes. The pipeline is identical!

def forComprehension(filename:String) : Result[SecretMessage] = {
  import transformers._
  for( file <- openFile(filename);
       line <- readFirstLine(file);
       result <- parseLine(line))
  yield { result }
}

This is a key concept of functional programming: separating out context. Sometimes that context contains state, other times isolated side effects (I/O), other times order of execution (synchronous, asynchronous, parallel). We can abstract things that used to be inherent. An imperative style cements mutable state and step-by-step order of execution; changing this is a huge amount of work and easy to get wrong. Functional style, once you get used to it, gives us more degrees of freedom.

Next time you write a method longer than a few lines, look for a path of data transformation. It might lead you someplace elegant.


[1] Sure, this sounds like an Either. If only Either implemented flatMap, it would completely work for this.

Functors: What the funk for?

For all the programmers who don’t deeply grok the lambda calculus terminology —

Say you are about to call a method on a container, and that container can give you something back of type Tweet. What you really want isn’t a Tweet, but some part of it, say Tweet.getId(). What if, instead of getting the Tweet back from the container and then calling getId() on it, you prefer to tell the container to get the id for you, and return only what you wanted? Then you need a functor.

So you make a quick function literal (tweet => tweet.getId()) that pulls the id out of a Tweet, and pass that in to the functor. The output is the same kind of container, only it holds IDs instead of Tweets. A functor isn’t a special type; it’s a function with a particular purpose.

Why would you want to do this? I used to think that too. Then…

I have an Iterable tweets, and I want to get the ID of the first element. In this example, I’m using Java with the Guava library. The function to call is Iterables.getFirst, which requires a default value in case the iterable is empty. As it happens, I have a default value for the ID, but no Tweet that contains this default ID. Short of constructing a stupid dummy Tweet that holds my default ID, I’m stuck with:

Tweet firstTweet = Iterables.getFirst(tweets, null) 
return firstTweet == null ? defaultId : firstTweet.getId() 

I have to create an identifier and then check stuff on it. This is not declarative enough or simple enough for my taste. I’d rather do this (using Java 8 lambda syntax for brevity):

return Iterables.getFirst(tweets, t => t.getId(), defaultId)

In this hypothetical example, I’m telling that getFirst function to run my function on the tweet before returning. If tweets is empty, then the function can return defaultId. So I’m supplying a default that is meaningful to me, the function is going to return the same type whether there’s a tweet or not, and everyone is happy.

But, that method doesn’t exist. And it would be a pain to an optional function argument to the interface of all the functions in Iterables. Thinking in terms of functors, I can solve this problem with functions that do exist in Iterables:

Iterables.getFirst(Iterables.transform(tweets, t => t.getId()), defaultId)

Here, transform is a functor: it applies the supplied function to the elements inside the iterable, returning an iterable of the function-output type. This means a tweet gets turned into a tweetId before getFirst sees it.

To a long-time Java dev like me, this looks inefficient. Do work on every tweet in tweets just to get the first one out differently? Ahhh, but Guava iterables are lazy! That function is not applied until somebody calls next() on the iterable returned by transform(tweetst => t.getId()). Iterables.getFirst calls next() at most once, so only the first tweet is transformed. Therefore, I have exactly what I wanted: turn that first tweet (if there is one) into an id before giving it back to me. The type of defaultId matches the element type of transform(tweetst => t.getId()).

The lesson of functors is: you don’t have to take something out of a container in order to operate on it.

In OO design, there’s a principle of “Tell, don’t ask.” Classes are supposed to tell each other what to do, rather than asking for internal data. This is an example of that — the Iterable has tweets, and I want the ID of the first one. Using the functor, I tell it “give me the result of this function applied to your tweet.” This is better encapsulation than pulling out the whole tweet and operating on it.

In this example, a functor is a function that does a transformation on data inside a context. In this example the context is an Iterable and its contents are Tweets. The transformation is a functor from Tweet -> ID. This way, an Iterable can give me back an ID, exactly what I wanted in the first place, without me ever having to see its Tweet.

Look! A functional trick can make my Java more OO than ever.


Caveat: there are many definitions for Functor out there, and different types of functors. This is one.

Brains, computers, and problem solutions

Programmers tell the computer what to do.

The computer has a processor, it has some memory, and it follows a set of instructions. The instructions say what to load into memory, what to do with the values in memory, how to change that memory, and when to store it off somewhere else. There is an instruction pointer that shows the next instruction to execute. There are tests that can change where the instruction pointer goes next.

(approximation of a Harvard architecture)

Assembly language is the closest to pure computer programming, to telling the computer exactly what to do. Imperative languages like C are an abstraction above that. We still tell the computer what to do, in what order, and how to change the values in memory. Code is in one place, data in another. Object oriented languages like Java offer a few more abstractions, organizing relevant code with data, hiding memory management. These languages are still imperative: we tell the processor what to do and in what order.

Now that we don’t program business software in assembler, do programmers still tell the computer what to do? We tell the compiler what to do, and that tells the runtime what to do, and that tells the computer what to do. What do we really do?

Programmers solve problems.

When we learn to code we learn to think like the computer. But what if that is not the optimal way to think? what if we think instead of how the problem looks, what the solution looks like?

(it’s abstract, OK? It’s supposed to represent a data flow.)

Programming in a declarative style means stating solutions, not steps. State what is true, or state what we’re doing – don’t describe how the computer goes about it. That’s an implementation detail for the compiler to optimize. (I’m talking business software here, not embedded systems.) Code in a language suited to the problem. This gives us more ways to solve problems, and cleaner ways to solve problems.

We’re not limited to what comes naturally to a computer, or what comes naturally to our brains.

Abstraction is key. The right abstractions make our code easy to read and faster to write and maintain. Don’t limit abstractions to the ones that fit the computer. Don’t limit them to objects that model business entities. This is why learning math is useful, and it’s why functional programming is useful: the more abstractions we know how to hold in our head — abstractions with no analog in either the real world or the computer world — the more tools we have for solving problems.