Readable, or reason-aboutable?

My coworker Tom finds Ruby unreadable.

What?? I’m thinking. Ruby can be quite expressive, even beautiful.

But Tom can’t be sure what Ruby is going to do. Some imported code could be modifying methods on built-in classes. You can never be sure exactly what will happen when this Ruby code executes.

He’s right about that. “Readable” isn’t the word I’d use though: Ruby isn’t “reason-aboutable.” You can’t be completely sure what it’s going to do without running it. (No wonder Rubyists are such good testers.)

Tom agreed that Ruby could be good at expressing the intent of the programmer. This is a different goal from knowing exactly how it will execute.

Stricter languages are easier to reason about. In Java I can read the specification and make inferences about what will happen when I use the built-in libraries. In Java, I hate the idea of bytecode modification because it interferes with that reasoning.

With imperative code in Java or Python, where what you see it what you get, you can try to reason about these by playing compiler. Step through what the computer is supposed to do at each instruction. This is easier when data is immutable, because then you can trace back to the one place it could possibly be set.

Beyond immutability, the best languages and libraries offer more shortcuts to reasoning. Shortcuts let you be sure about some things without playing compiler through every possible scenario. Strong typing helps with this: I can be sure about the structure of what the function returns, because the compiler enforces it for me.

Shortcuts are like, I can tell 810 is divisible by 3 because its digits add to a number divisible by 3. I don’t have to do the division. This is not cheating, because this is not coincidence; someone has proven this property mathematically.

Haskell is the most reason-aboutable language, because you can be sure that the environment won’t affect execution of the code, and vice-versa, outside of the IO monad. Mathematical types like monoids and monads help too, because they come with properties that have been proven mathematically. Better than promises in the documentation. More scalable than playing compiler all day.[1]

“Readability” means a lot of things to different people. For Tom, it’s predictability: can he be sure what this code will do? For many, it’s familiarity: can they tell at a blink what this code will do? For me, it’s mostly intent: can I tell what we want this code to do?

Haskellytes find Haskell the most expressive language, because it speaks to them. Most people find it cryptic, with its terse symbols. Ruby is well-regarded for expressiveness, especially in rich DSLs like RSpec.

Is expressiveness (of the intent of the programmer) in conflict with reasoning (about program execution)?


[1] “How do you really feel, Jess?”

We want to keep our programs simple, and avoid unnecessary complexity. The definition of a complex system is: the fastest way to find out what will happen is to run it. This means Ruby is inviting complexity, compared to Haskell. Functional programmers aim for reason-aboutable code, using all the shortcuts (proven properties) to scale up our thinking, to fit more in our head. Ruby programmers trust inferences made from example tests. This is easier on the brain, both to write and read, for most people. It is not objectively simpler.

TDD with generative testing: an example in Ruby

Say I’m in retail, and the marketing team has an app that helps them evaluate the sales of various items. I’m working on a web service that, given an item, tells them how many purchases were influenced by various advertising channels: the mobile app, web ads, and spam email.

The service will look up item purchase records from one system, then access the various marketing systems that know about ad clicks, mobile app usage, and email sends. It returns how many purchases were influenced by each, and uses some magic formula to calculate the relevance of each channel to this item’s sales.

My goal is to test this thoroughly at the API level. I can totally write an example-based test for this, with a nice happy-path input and hard-coded expected output. And then, I need to test edge cases and error cases. When no channels have impact; when they all have the same impact; when one fails; when they timeout; et cetera et cetera.

Instead, I want to write a few generative tests. What might they look like?

When I’m using test-driven development with generative testing, I start at the outside. What can I say about the output of this service? For each channel, the number of influenced purchases can’t be bigger than the total purchases. And the relevance number should be between 0 and 100, inclusive. I can assert that in rspec.

expect(influenced_purchases).to be <= total_purchases
expect(relevance).to be >= 0
expect(relevance).to be <= 100

These kinds of assertions are called “properties”. Here, “property” has NOTHING TO DO with a field on a class. In this context, a property is something that is always true for specified circumstances. The generated input will specify the circumstances.

To test this, I’ll need to run some input through my service and then make these checks for each output circle. It needs some way to query the purchase and marketing services, and I’m not going to make real calls a hundred times. Therefore my service will use adapters to access the outside world, and test adapters will serve up data.

result = InfluenceService.new(TestPurchaseAdapter.new(purchases),
                make_adapters(channel_events)).investigate(item)

result.channels.each do |(channel, influence)|
  expect(influence.influenced_purchases).to be <= total_purchases
  expect(influence.relevance).to be >= 0
  expect(influence.relevance).to be <= 100
end

(relatively complete code sample here.)
To do this, I need purchases, events on each channel, and an item. My test needs to generate these 100 times, and then do the assertions 100 times. I can use rantly for this. The test looks like this:

it “returns a reasonable amount of influence” do
 property_of {
  … return an array [purchases, channel_events, item] …
 }.check do |(purchaseschannel_eventsitem)|
  total_purchases = purchases.size
  result = InfluenceService.new(TestPurchaseAdapter.new(purchases),
                make_adapters(channel_events)).investigate(item)

  result.channels.each do |(channel, influence)|
   expect(influence.influenced_purchases).to be <= total_purchases
   expect(influence.relevance).to be >= 0
   expect(influence.relevance).to be <= 100
  end
 end
end

(Writing generators needs a post of its own.)
Rantly will call the property_of block, and pass its result into the check block, 100 times or until it finds a failure. Its objective is to disprove the property (which we assert is true for all input) by finding an input value that makes the assertions fail. If it does, it prints out that input value, so you can figure out what’s failing.

It does more than that, actually: it attempts to find the simplest input that makes the property fail. This makes finding the problem easier. It also helps me with TDD, because it boils this general test into the simplest case, the same place I might have started with traditional TDD.

In my TDD cycle, I make this test compile. Then it fails, and rantly reports the simplest case: no purchases, no events, any old item. After I make that test pass, rantly reports another simple input case that fails. Make that work. Repeat.

Once this test passes, all I have is a stub implementation. Now what? It’s time to add properties gradually. Usually at this point I sit back and think for a while. Compared to example-based TDD, generative testing is a lot more thinking and less typing. How can I shrink the boundaries?

It’s time for another post on relative properties.

Aqueductron – toying with dataflow in Ruby

I love playing with Ruby because it lets me express concepts clearly[1]. In my aqueductron gem, two concepts are expressed. It’s about processing data, and about code modifying code, all without modifying anything[2].

The metaphor of Aqueductron is an aqueduct. Data is the water, taking the form of droplets flowing through the ducts. Each piece of duct might be a filter, or a map. At the end, there’s a collector of some kind.

The interesting parts of dataflow are where the metaphor breaks down. For instance, take a duct with a split in it. With real water, each drop will take exactly one path. With data, each drop can go down all paths, or none.

In the real world, an aqueduct is constructed before the water ever flows, and the aqueduct stays the same forever. In aqueductron, a duct piece can change with every drop. The split can add paths as it encounters new information.

  

When the delta between drops is more interesting than the drops themselves, pieces can alter themselves based on what comes through. For added challenge and sanity, aqueductron does this without mutating state.

Since the ducts can change as data flows, it is useful to see what the aqueduct looks like in between drops. The Ruby REPL is handy, and aqueductron is equipped with ASCII art.

Create new paths

For the Lambda Lounge code shootout recently, we implemented some problems from Rosalind.info. Here’s the simplest one, generalized from “count nucleotides” to “count all the characters that you see.” My code is explained in detail here.

Construct a pipe that expands a string into its characters, then creates a path for each unique letter. It is empty at first.

2.0.0> a = Duct.new.expand(->(s) {s.each_char}).
                    partition(->(a) {a.upcase}
                              ->(a) {Duct.new.count})
 => Duct:
— / 
 ~ <  +?
— \  

Send it one string, and the duct creates new paths.

2.0.0> b = a.drip(“ACtT”)
 => Duct:
      —\
        # (1)
      —/
— / —\
 ~ <   C  # (1)
— \ —/
      —\
       T  # (2)
      —/
      +? 

Modify existing paths

Another simple Rosalind problem describes rabbit populations as a modified fibonacci sequence. There’s a multiplier (k) applied to the penultimate number as it’s added to the last one to generate the next Fibonacci number. In aqueductron, the duct can learn from the data coming through, changing as the data flows in, each one generation’s rabbit population. When it’s time to make predictions, the pipe uses what it learned. In this case, the pieces are decorated with a description of the function inside them. Code details are here.

Build an empty pipe:

2.0.0> rabbits = Duct.new.custom(empty_fib_function).last
 => Duct:
—\
 ~  last
—/ 

Then drip information about the generations through, the duct learns.

2.0.0> rabbits.drip(2)
 => Duct:
—–\
 2..  last (2)
—–/ 
2.0.0> rabbits.drip(2).drip(5)
 => Duct:
——-\
 2,5..  last (5)
——-/ 
2.0.0> rabbits.drip(2).drip(5).drip(9)
 => Duct:
————————\
 ..5,9.. starting k~2.0  last (9)
————————/ 

When asked to predict future generations, the duct uses what it has learned.

2.0.0> rabbits.drip(2).drip(5).drip(9).drip(:unknown)
 => Duct:
—————-\
 ..9,19.. k=2.0  last (19)
—————-/ 

Learning dataflow

The part where the flow changes with the data fascinates me. That it changes itself without mutating state fascinates me even more. These are the concepts explored in aqueductron. Look for more aquedutron on this blog, past and future.

————–
[1] where my target audience is devs (like me) who are more comfortable with objects than Lisps.
[2] it’s Ruby, so forcing immutability is a lot of work. Since I’m going for clarity, aqueductron is immutable by choice, not by compiler restriction.

Dataflow in Ruby

Our job is not to write software. Our job is to turn data into information. That we do that through software is an implementation detail. — Dan North, ScanDev 2013.

Two things about data these days: there’s a lot of it, and a lot of it is crap. How can we get information out of that?

I like to think of my code in terms of a pipeline for data, transforming and working it and summarizing it into information. In order to handle growing amounts of data, we can’t process it all at once; we have to handle it piece by piece as it comes in. Each piece gets summarized in all the different ways and then discarded, so we never run out of memory.

With lazy Enumerables (known in other languages as Iterables), it’s easy enough to accommodate the massive data, as it deals with one item at a time. However, every time you run a reduce over the output, the entire input is read again. How can we perform multiple summaries over the same flow of data as it comes through?

My goal: process data lazily, with multiple reductions of various subsets, without mutating state. In Ruby, because Ruby is like Play-Doh for programmers.

My code is in a branch of my fp4rd repo (that’s Functional Principles for Ruby Development). This post proceeds with an explanation of the example problem and then how the solution works.

If you can make it to Ruby Midwest this weekend or Windy City Rails in September, come hear about functional principles for Ruby developers. This little project is the extreme endpiece of that talk.

Problem statement

In Chapter 3 of the Pickaxe, there’s an example of a program that parses book-inventory CSV data like this:

Title,ISBN,Price
To Whom It May Concern,48-2342-182u32,98.56
What Is Your Problem,2938-123883-13,13.99

Of course, being data, it doesn’t all look like that. Some of it is completely unparsable:

Title,That Number Under The Barcode,Amount
George of the Jungle,234-34-,99.44

and some of it is parsable but missing information:

Title,ISBN,Price
Your Mother Was a Lizard,,32.99
Your Father Stank of Elderberries,234-2145-ldk-234,

My program will parse these files. It totals the price of all books that are parsable and have a price; it also counts these, and it counts the lines that were read but not totaled.

Crazy-looking solution

Code is here, and here’s the meat of it.

pipe = Pipeline::Pipe.new.
  expand(printing.(“— Reading file…”,&read_all_lines)).
  through(printing.(“1. Converting book”,&convert_row_to_book)).
  through(printing.(“2. Checking price”,&reject_no_price)).
  split(
    invalid: Pipeline::Pipe.new.keeping(->(a){a.invalid?}).count,
    valid: Pipeline::Pipe.new.keeping(printing.(“3a. Checking book”, ->(a){a.book?})).
      split( count: Pipeline::Pipe.new.count,
             total: Pipeline::Pipe.new.
        through(printing.(“3b. Extracting book”, ->(a){a.book})).
        through(printing.(“4. Pricing”,->(a){a.price})).
        answer(Pipeline::Monoid.plus)
      )
  )

result = pipe.flow(ARGV)

totalPrice = result.value(:valid, :total)
validCount = result.value(:valid, :count)
errorCount = result.value(:invalid)

What is it doing?
It’s setting up a pipeline, a pipeline with three outlets.
It pushes some data through.
It follows the three routes to get the answer at each end.

It’s going like this:

Files from ARGV go through one at a time. They get expanded into lines, which go through a transformation into books. Data is sent through all the routes of each split. Some gets filtered out by “keeping“.[1]

Crazy internals

As the pipe is constructed, it builds up a list of functions. Each function is an iteratee.

Let’s talk about iterators

Say you have an Enumerable. It represents stuff in a particular order. There are two ways to move through that stuff and do something with it:

External iteration means you control the flow of when to get the next one. Get an external iterator from a Ruby Enumerable by calling .each with no arguments. Then call .next on that whenever you please to get an element. Or a StopIteration exception, when there’s no more data. [2]

> e = [1,2,3].next
> e.next
 => 1

Internal iteration means you say what to do with each element, but the Enumerator controls the process of flipping through them. This is the usual pattern in Ruby.

 > [1,2,3].each { |i| puts i }

Two advantages of internal iteration: no mutating state in my code; the Enumerable can perform cleanup after completion, like closing the input file.
One advantage of external iteration: if I want to stop early, I can. If my objective was to find one book that costs more than $5, controlling when to stop could save a lot of time.

To get the best of both, Oleg Kiselyov came up with this idea of iteratees. You hand the Enumerable a function, but that function doesn’t just return a transformed value (like map).  Instead, your function returns a message, one of:

  • “That’s great man, keep going, and next time do this”
  • “I’m done, and here’s the final result”

If you decide “keep going,” then included in the message is another function: the next piece of data will get passed in to that one. So the function you give the Enumerable gets executed exactly once, and supplies either a result or another function. Higher-order functions in action!

For added message-passing goodness, your function doesn’t always receive data. After the last item in the Enumerable is processed, it sends in End Of File. When you get that, you’d better return a final result.

My iteratees

That’s what happens in the pipeline: each piece of the pipe is an iteratee, which can receive a message with data or an EOF, and returns another piece (for next time) or the final result. For instance, look at the end piece count:

  class CountingEndPiece
    include PieceCommon
    def initialize(soFar = 0)
      @soFar = soFar
    end 
    def eof 
      SimpleResult.new(@soFar)
    end 
    def receive msg 
      CountingEndPiece.new(@soFar + 1)
    end 
  end 

At EOF, it always gives a result. For a piece of data, it always gives back another one of itself with an incremented count.[3]

That was an end piece. What about the middle pieces?

They’re iteratees too, except they delegate to the next piece. A “through” piece does a transformation and then delegates. A “keeping” function delegates only if the predicate is satisfied, otherwise returning itself again, waiting for the next piece of data.

End Construction

The Pipe uses the builder pattern, and the build is triggered by an end piece: answer, count, or split. Once the end is reached, all the pieces can be hooked to each other, and the inlet returned.

Sorry About the Monoids

I couldn’t resist using monoids for answer. Don’t worry about them: they’re a class that defines both a combining-method and a starting point for the reduce. Adding integers and concatenating strings are different monoids because they have different starting points. The starting points are necessary for the empty data case.

Flow

All the data goes in, and is passed to all the iteratees down the line. When everything has returned a Result (which happens at EOF in the implemented cases), then the pipe is resolved.

result = pipe.flow(ARGV)

Catching the output

The pipeline has multiple paths, so follow them using the symbol-labels. That’ll get you to the answer at the end of each pipe.

totalPrice = result.value(:valid, :total)
validCount = result.value(:valid, :count)
errorCount = result.value(:invalid)

Conclusion

There you have it: Iteratees in Ruby, and some sort of dataflow pipeline. This is one way to think about data as a river running through your program without getting washed away. Each piece of incoming data was retained as information: either it was counted, or it was counted and its price totaled.

This is one way to handle a crapton of craptastic data. In Ruby!

——-
[1] I know, I know, “keeping” is filter and “through” is map and “expand” is flatmap. But you know what? These names don’t make that much sense in English. They only make sense to people already familiar with functional programming vocabulary and idioms, and that’s not my primary audience.

[2] Why does Ruby’s external iterator not have some sort of hasNext method? Or is there one and I can’t recognize it?

[3] If I were writing this for production I’d totally keep a stateful count and return itself after each message. But that’s a mutable-state optimization, and this is an exercise in functional style.

Passing functions in Ruby: harder than it looks

People say, “Oh, Ruby has functional programming! We pass blocks around all the time!”

I’m sorry to inform you: Ruby blocks are not first-class functions.

Functional programming is so called because we find it useful to pass functions as values. When we do that, we expect that the function can be called and:

  • it’ll get its own level on the stack
  • when it returns, that level of the stack will go away
  • this stack is the same one used by the caller.

When a function returns, whoever calls it gets the return value and goes on its merry way. Sounds easy, right?

Yeah, not in Ruby. See, the common case in Ruby is to pass code in as a block, something like this:

format_all_the_things { |a| “format #{a} into a string” }

Then inside the higher-order function, the block of code is called using yield:

def format_all_the_things
   one = yield “thing 1”
   two = yield “thing 2”
   [one, two].join ” & “
end

Here, the block of code is an invisible parameter. It’s a little like a first-class function, but not really. For one, blocks don’t get their own level on the stack. For another, they don’t execute in the same stack as the caller. Best I can tell, this “yield” keyword fires up a coroutine, which gets its own stack. Two violations of my expectations of a first-class function

Danger: return

Use the return keyword in a block, and you invite all kinds of trouble. See, return means it’s time to knock a level off the stack. Since the block didn’t get its own level in the stack, it has to return from the caller. But in this case, since it was started in a coroutine, there’s nothing there to go back to — LocalJumpError.

 > format_all_the_things { |a| return “yay #{a}” }
LocalJumpError: unexpected return

So, return is your enemy in blocks. Also in Procs.

this is even stranger
wrap the whole thing in a lambda and the block doesn’t local jump error; it happily returns from the lambda when it hits the first return keyword inside a block. There’s some magic going on here. This works for blocks, but not Procs. Can anyone explain this to me?

> lam = -> { format_all_the_things { |a| return “yay #{a}” } }
> lam.call
 => “yay thing 1”

Ruby has a third option: lambdas. These behave much more like the functions that I know and love.

lam = lambda { |a| return “yay #{a}” }

Lambdas do (seem to) get their own level on the stack, and when they return they return from themselves, not their caller.
Unfortunately this still doesn’t work with yield.

> format_all_the_things &lam
LocalJumpError: unexpected return

Dangit! I thought I could trust lambdas, but I was wrong.
Turns out even lambdas behave sanely with return only if accessed using .call() instead of yield. This is the friendly way:

def format_all_the_things(&formatter)
   one = formatter.call(“thing 1”)
   two = formatter.call(“thing 2”)
   [one, two].join ” & “
 end

format_all_the_things &lam
 => “yay thing 1 & yay thing 2” 

This is the right way to pass and use functions as values in Ruby: lambdas and .call().

Next

“Just don’t use return!” you may say. Ruby devs are wonderfully disciplined at avoiding the language’s less respectable features.

Sometimes the code is clearer if you can exit the function as soon as you know what the return value is. Especially with one of those handy post-statement ifs:

return “REDACTED” if (a.contains_sensitive_information)

Oddly, there is a control statement that appears to do exactly what I want return to do, and that’s next. No longer a loop-control statement, it appears in Ruby 2.0 to be everything we could wish return to be.

> format_all_the_things { |a| next “REDACTED” if (a.include?(“1”)); “yay #{a}” }
 => “REDACTED & yay thing 2” 

More danger: Break

There are some circumstances where you can use break within a block to end Ruby’s internal iteration early.

[4,5,-1,7].each { |n| break :invalid if n < 0 }

This works only with blocks, not with Procs (LocalJumpError) or lambdas (ignored). I think it’s evil. Code that exercises flow control on its caller is not a function value.

Conclusion

If you’re going to use a functional programming style in Ruby, use lambdas and invoke them with .call().

——
This is all in Ruby 2.0.

Here are my experiments in blocks, Procs, and lambdas with return, break, and next.