Monday, May 13, 2013

Two Models of Computation: or, Why I'm Switching Trains


"In cognitive science, we only use the lambda-calculus model of computation," says Dr Philipp Koralus (Phd, Philosophy & Neuroscience, on his way to be a lecturer at Oxford). "We want to talk about what the system is doing, and abstract the how."

Two models of computation: lambda calculus and the Turing machine. Teacher and student, Church and Turing.
Years later, engineering catches up and we start building computers.
John McCarthy favored an encoding of the lambda calculus. John Von Neumann built a Turing machine, with a separation of processing and data. "Von Neumann won because he was smarter," says Barbara Liskov. Von Neumann's particular type of intelligence was a huge working memory: he could hold a crapton of state in his head and manipulate it. Practical computation took the Turing machine route, stateful and imperative.

Von Neumann's intelligence is not reproducible. We looked for ways that normal people could write and maintain programs in this architecture. We abstracted to assembly, then C, then object-orientation. All abstractions that let us partition the program into small enough bits that each bit can fit in our heads. We created some design principles to add enough formal structure that, with great effort, we could port some pieces of solutions to other problems.

Meanwhile, lambda calculus and functional languages fell to academia. Now they're coming back. Now memory and processes are plentiful enough that we can afford to force a different computational model onto Von Neumann's architecture. But why would we?

What's the difference that makes the functional model of computation more useful to theoreticial cognitive scientists, who are modeling the reasoning method of the human mind? Philip says it's the semantics and the systematicity.  "You can't get arbitrary novelty and recombination without a formal system."

But what about reuse in imperative languages? We do manage this. "You can get similarity in templates," says Philipp. We can have design patterns, frameworks, and libraries. We can attend conferences and learn the tools people have crafted so we can recognize when our problem might mostly fit. Bur for arbitrary novelty, we have to get down to the language and code it ourselves. With our fingers.

I used to think that FP concepts like monads and monoids were like OO design patterns, only more abstract. Now I see that they are fundamentally different. If code can be represented in a formal symbolic notation, then any piece can be broken away and used differently in a different place. Like you can pull any subexpression out of a mathematical equation and think about it on its own.

It's the difference between a neural network, which can learn to recognize faces, and a symbolic representation of faces that says, "This is a nose. This is an eye. This is the distance between the nose and eye." The symbolic system gives us meaningful dimensions we can use independently and learn from. The neural network tells us "You" or "Not you."

Then, composition. In OO our idea of composition is a has-a relationship. Yet, the owner object must be written to accommodate and pass messages through to its components. Contrast this with functional composition, which works like this:

find . -name '*.txt' | xargs cat | grep 'ERROR' | cut -d ':' -f 2 | sort | uniq -c

Each symbol here, "find", "cat", "grep", "sort" has an independent meaning and is useful alone. Functional composition is a fits-together relationship. Neither part knows anything about the others. We piece them together to serve arbitrary purposes. Best of all, we don't have to be Von Neumann to understand any given piece or conjunction.

Now that the field of software development recognizes the value of a more "tell us what it does" declarative and recombinable-without-template-matching computational model, some of us are struggling. We grew up learning to think like Von Neumann.
I used to say at interviews that my best skill was holding the whole system in my head. Now I recognize that this was a crappy way to reason. It doesn't scale, and I can't pass that understanding on to others as a whole. The wiser goal is to eliminate a need to load everything into one head. It's going to be a tough transition from the relatively-informal structure I'm used to, but now I'm sure it's worth it. I only hope I can take some of you with me.

Sunday, April 28, 2013

Evaluating the Value of Code

In retail marketing, there is a metric of customer importance called RFM: Recency, Frequency, Monetary value. How long ago was the last order? How many times have they ordered? How much money have they spent with us? This determines how valuable the customer is, and how much effort the company should put in to keeping them happy.

R + F + M => $$$

At ScanDev, Dan North spoke about patterns of development that he observed in a phenomenally productive, short-turnaround, high-quality team. This team changed the application daily, reacting to new requirements instantly. Yet, the codebase didn't degrade over time. Why? How were they able to accomplish what we keep telling our managers is impossible?

For one thing, they recognized: not all code is equally valuable.

Dan's team didn't refactor each feature to near-perfection before committing it and trying out in production. Many features would change or get deleted within a few weeks of implementation. Time spent refactoring features that soon become obsolete is worse than wasted: reducing duplication means introducing indirection, which makes other code less simple.

Don't get me wrong - Dan's team refactored constantly. But they spent their refactoring energy on code that had proved its value, rather than on today's hot new thing. They put module-level tests around the new stuff, and deeper and more detailed tests around code that turned out to be critical. They let time pass before determining which code was important. Much like developers suck at predicting performance problems as we code[1], and we're lousy at predicting change[2], sometimes we also stink at predicting which piece of code is important.

What if we could measure the value of code the way retailers evaluate customers?
There are metrics we could use to suggest an expected value of code - the importance of the code to the business and to developers.

R - Recency: the more recently the code was written, the lower its expected value. Our VCS can tell us.
F - Frequency: how often is the code executed in production? We could sample this with a profiler.
M - Modification: how many times has the code changed? VCS has the answer.

What if these three metrics were compared against estimates of localized technical debt? Tools like SONAR quantify technical debt, but they say nothing about code value. If we could stack up code coverage, style violations, and complexity against the importance of a piece of code, this would help us focus our refactoring efforts.

What of the low-value, high-technical-debt code? The ugly stuff that's still accumulating in our codebase because it is not important enough to fix? Delete it. A premise of the delayed-refactoring   strategy is that some code will go away. Pruning little-used features out of the codebase is necessary to keep it sane. Deleting code is the best refactoring: it's like declaring technical-debt bankruptcy, but without the bad credit rating.[3]

Google does this, nixing the least-used features and products. We think, "Why did they take it away? It was already working, it wasn't costing them anything!" but it was. Because every new feature comes with that silent requirement "... and everything else works too." And the rat's nest grows.[4]

Another potential problem: what if we go back to refactor and we can't figure out what the code does? Ay, there's the rub. Delaying refactoring doesn't mean writing cryptic code. It is even more essential that your code convey its intent with good naming and concrete expressions of what you're doing. Avoid abstraction. Let repetitive code live a while longer. Keep it contained, though: favor adding code over changing existing code in many places, even if that means you're copying and pasting.

The Last Responsible Moment for design is later than you think, and not all technical debt is created equal. There is a cost to careful design, and we can weigh this cost against a value only measured in production.

------------
[1] Premature Optimization Is The Root Of All Evil -- Knuth
[2] You Ain't Gonna Need It -- XP
[3] It's like promising your kid chocolate, and then by snacktime she decides she doesn't like chocolate anymore and you get to eat the chocolate. Good-bye, bad code. Hello, faster compile times!
[4] There's this great song by Bobaloo that goes,
"My hair had a party last night.
When I laid down everything was all right.
It started out friendly but there must have been a fight!
My hair had a party last night." 
Feature-heavy applications feel like that. Gotta brush 'em out and split 'em up. Or cut 'em off.

Thursday, April 25, 2013

Property-based testing: what is it?

What is this property-based thing?

Property-based tests make statements about the output of your code based on the input, and these statements are verified for many different possible inputs.

A property-based testing framework runs the same test over and over with generated input. The canonical framework is QuickCheck in Haskell. My experience is with ScalaCheck.

This contrasts with example-based testing, which is most of the tests we write these days. Each unit test sets up one input scenario, runs the code under test, and then checks the output (and any other effects the code is supposed to have).

Instead, one property-based test runs hundreds of times with different inputs. The testing framework will try to get the test to fail by passing empty lists, negative values, all the possible edges cases. It'll pass in long lists and high numbers and strings with special characters.

The property-based tester has to think very carefully about the specification. What kind of input is supported? Encode that in the preconditions of the test. How can the input be generated? For custom types, this takes at least a bit of code. What statements can we make about the output? Encode all these in one test, or one test per statement. This is hard!

For example, say I'm writing a function that takes in a bunch of sets
[B,F] [A,B,C] [A,D,E,F] [C] [B,F] [R]
and chooses a number of these sets to return, with the goal of returning the minumum number of sets that still include all the elements in all of the input sets. (Really, I did this at work a while back.) So the optimal output here is:
[A,B,C] [A,D,E,F] [R]
For my property-based test, the input is any set of sets. I can use sets of integers, since anything with an equals method will do.
I must make the following statement about the output:
  1. Every element in the input is also in the output
I could state the following, to sanity-check my function:
  1. Every output set was in the input
  2. The quantity of output sets is less than or equal to the input
  3. The same set never appears more than once in the output
  4. No output set is a subset of any other output set
And then the statement I'd really like to make:
  1. For every other possible combination of input elements such that (1) is true, the number of sets included is never fewer than the number of sets output by my function.
I could implement something like this in ScalaCheck. It isn't trivial. The first problem is: letting it generate any old set of sets of integers runs the JVM out of memory super fast. I have to code in length-limits for the sets. The second problem is: statement (6) takes forever to execute for more than a few sets, because they're O(n!). Fourteen input sets, ten billion combinations. Oops. Maybe it's worth making this check if I limit the input size to five.

See how much thought goes into a property-based test? and this is a simple specification! I don't recommend writing these for all your code - only the really important stuff.

The value of this type of testing is that it forces you to think about the code. If any possible input is not supported, that has to be considered and then codified into the test. empty sets? disjoint sets? identical sets?

All kinds of stuff will be tested in one test case. The framework will make a point of testing edge cases, and then it'll randomly generate a hundred possibilities. This saves you from writing all those edge-case tests, which saves repetition in your test code. Extreme thoroughness without repetition.

Property-based tests are best combined with example-based tests. Examples help you start organizing your thoughts, and they're easier for future-you to read and understand when you come back to this code later. Humans think in examples. Programs don't extrapolate. Property-based thinking and property-based testing can bridge between us and the computer. Math, it's a tool.

Also, it's fun to write one test and see "100 assertions passed" in the output.

Saturday, April 13, 2013

Diagramming trick: pointing at text in Omnigraffle

Irrelevant background rant: That very last post I spent half an hour too long making the diagram, because I wanted curved arrows pointing at text. Powerpoint wouldn't let me adjust curvature. Keynote wouldn't put an arrow on the end of a custom shape. So I freaking bought Omnigraffle (I love it at work) which is fabulous for connectors of custom shapes. Plus Omnigraffle has the fabulous features of "hold down T and click to make text" and "hold down C and draw a line to make a connector." But...

The otherwise lovely connectors in Omnigraffle insist on connecting to the edges of my text box.

Goal: get this red arrow to point directly at the black text "override."

Problem: The arrow insists on hooking itself to the side of the text box.
Step 1: Illuminate the problem. In the View menu, select Magnets to show that darn spot on the side of the text box that keeps sucking in my connector end.
Step 2: Fine, Mr. Connector, if you insist on hooking to a magnet, I'll just make my own! Hold down "m" and click on the spot where I want the arrow to point, and I get a new magnet:
Step 3: Move the connector end to the new magnet. Now everyone is happy.
Step 4: Turn off View -> Magnets and admire the result
Hurray! As a bonus, if I move my text box around, the connector will adjust too, and it will point right where I wanted it. Adjusting the font or changing the contents of the box will move the text. When you need to move the magnet, hold down "m" and then click and drag it to the new spot.

This may seem trivial people but this trick is already changing my life! I'm so much happier now that I can point at a piece of text without pain. I never knew how much this misery leaked into every corner of my existence until it was gone.

Scala: when is a val not a val?

TL;DR: before it's initialized.

Do vals in Scala always contain the same value?
Not quite.

I hit this the other day when I used a val inside a def inside a val.

In the superclass:


abstract class GeneralTestCase {
  def createArray = Array(1,2,3,4)
  val defaultInput createArray
// defaultInput = [1,2,3,4]
}


and then in the subclass, customized behavior to make that default input be empty... but it comes out a little too empty.

class EmptyInputTestCase extends GeneralTestCase {
  val emptyArray = Array.empty[Int] 
  override def createArrayemptyArray
// defaultInput = null
}

Why is defaultInput null instead of an empty array? Because of initialization order. The bodies of classes are the constructor, and the order of class construction looks like this:


The superclass is initialized first, just as in Java the superclass constructor is called first. It recognizes the method override for createArray. Unfortunately the subclass's implementation of createArray references a subclass field, and the subclass fields are not initialized yet. That comes after superclass construction.

This is a known rule in Java: don't call non-final methods from the constructor, because subclasses can override them and reference uninitialize fields. The same applies in Scala. It's trickier to spot in Scala because the whole class body is the constructor, and references to vals look the same as calls to defs without argument lists.

Wednesday, April 3, 2013

Dataflow in Ruby


Our job is not to write software. Our job is to turn data into information. That we do that through software is an implementation detail. -- Dan North, ScanDev 2013.

Two things about data these days: there's a lot of it, and a lot of it is crap. How can we get information out of that?

I like to think of my code in terms of a pipeline for data, transforming and working it and summarizing it into information. In order to handle growing amounts of data, we can't process it all at once; we have to handle it piece by piece as it comes in. Each piece gets summarized in all the different ways and then discarded, so we never run out of memory.

With lazy Enumerables (known in other languages as Iterables), it's easy enough to accommodate the massive data, as it deals with one item at a time. However, every time you run a reduce over the output, the entire input is read again. How can we perform multiple summaries over the same flow of data as it comes through?

My goal: process data lazily, with multiple reductions of various subsets, without mutating state. In Ruby, because Ruby is like Play-Doh for programmers.

My code is in a branch of my fp4rd repo (that's Functional Principles for Ruby Development). This post proceeds with an explanation of the example problem and then how the solution works.


If you can make it to Ruby Midwest this weekend or Windy City Rails in September, come hear about functional principles for Ruby developers. This little project is the extreme endpiece of that talk.


Problem statement


In Chapter 3 of the Pickaxe, there's an example of a program that parses book-inventory CSV data like this:

Title,ISBN,Price
To Whom It May Concern,48-2342-182u32,98.56
What Is Your Problem,2938-123883-13,13.99

Of course, being data, it doesn't all look like that. Some of it is completely unparsable:

Title,That Number Under The Barcode,Amount
George of the Jungle,234-34-,99.44

and some of it is parsable but missing information:

Title,ISBN,Price
Your Mother Was a Lizard,,32.99
Your Father Stank of Elderberries,234-2145-ldk-234,

My program will parse these files. It totals the price of all books that are parsable and have a price; it also counts these, and it counts the lines that were read but not totaled.

Crazy-looking solution


Code is here, and here's the meat of it.

pipe = Pipeline::Pipe.new.
  expand(printing.("--- Reading file...",&read_all_lines)).
  through(printing.("1. Converting book",&convert_row_to_book)).
  through(printing.("2. Checking price",&reject_no_price)).
  split(
    invalid: Pipeline::Pipe.new.keeping(->(a){a.invalid?}).count,
    valid: Pipeline::Pipe.new.keeping(printing.("3a. Checking book", ->(a){a.book?})).
      split( count: Pipeline::Pipe.new.count,
             total: Pipeline::Pipe.new.
        through(printing.("3b. Extracting book", ->(a){a.book})).
        through(printing.("4. Pricing",->(a){a.price})).
        answer(Pipeline::Monoid.plus)
      )
  )

result = pipe.flow(ARGV)

totalPrice = result.value(:valid, :total)
validCount = result.value(:valid, :count)
errorCount = result.value(:invalid)

What is it doing?
It's setting up a pipeline, a pipeline with three outlets.
It pushes some data through.
It follows the three routes to get the answer at each end.

It's going like this:

Files from ARGV go through one at a time. They get expanded into lines, which go through a transformation into books. Data is sent through all the routes of each split. Some gets filtered out by "keeping".[1]

Crazy internals


As the pipe is constructed, it builds up a list of functions. Each function is an iteratee.

Let's talk about iterators


Say you have an Enumerable. It represents stuff in a particular order. There are two ways to move through that stuff and do something with it:

External iteration means you control the flow of when to get the next one. Get an external iterator from a Ruby Enumerable by calling .each with no arguments. Then call .next on that whenever you please to get an element. Or a StopIteration exception, when there's no more data. [2]

> e = [1,2,3].next
> e.next
 => 1

Internal iteration means you say what to do with each element, but the Enumerator controls the process of flipping through them. This is the usual pattern in Ruby.

 > [1,2,3].each { |i| puts i }

Two advantages of internal iteration: no mutating state in my code; the Enumerable can perform cleanup after completion, like closing the input file.
One advantage of external iteration: if I want to stop early, I can. If my objective was to find one book that costs more than $5, controlling when to stop could save a lot of time.

To get the best of both, Oleg Kiselyov came up with this idea of iteratees. You hand the Enumerable a function, but that function doesn't just return a transformed value (like map).  Instead, your function returns a message, one of:

  • "That's great man, keep going, and next time do this"
  • "I'm done, and here's the final result"
If you decide "keep going," then included in the message is another function: the next piece of data will get passed in to that one. So the function you give the Enumerable gets executed exactly once, and supplies either a result or another function. Higher-order functions in action!

For added message-passing goodness, your function doesn't always receive data. After the last item in the Enumerable is processed, it sends in End Of File. When you get that, you'd better return a final result.

My iteratees


That's what happens in the pipeline: each piece of the pipe is an iteratee, which can receive a message with data or an EOF, and returns another piece (for next time) or the final result. For instance, look at the end piece count:


  class CountingEndPiece
    include PieceCommon
    def initialize(soFar = 0)
      @soFar = soFar
    end 
    def eof 
      SimpleResult.new(@soFar)
    end 
    def receive msg 
      CountingEndPiece.new(@soFar + 1)
    end 
  end 


At EOF, it always gives a result. For a piece of data, it always gives back another one of itself with an incremented count.[3]

That was an end piece. What about the middle pieces?

They're iteratees too, except they delegate to the next piece. A "through" piece does a transformation and then delegates. A "keeping" function delegates only if the predicate is satisfied, otherwise returning itself again, waiting for the next piece of data.

End Construction


The Pipe uses the builder pattern, and the build is triggered by an end piece: answer, count, or split. Once the end is reached, all the pieces can be hooked to each other, and the inlet returned.

Sorry About the Monoids


I couldn't resist using monoids for answer. Don't worry about them: they're a class that defines both a combining-method and a starting point for the reduce. Adding integers and concatenating strings are different monoids because they have different starting points. The starting points are necessary for the empty data case.

Flow


All the data goes in, and is passed to all the iteratees down the line. When everything has returned a Result (which happens at EOF in the implemented cases), then the pipe is resolved.


result = pipe.flow(ARGV)


Catching the output


The pipeline has multiple paths, so follow them using the symbol-labels. That'll get you to the answer at the end of each pipe.


totalPrice = result.value(:valid, :total)
validCount = result.value(:valid, :count)
errorCount = result.value(:invalid)


Conclusion


There you have it: Iteratees in Ruby, and some sort of dataflow pipeline. This is one way to think about data as a river running through your program without getting washed away. Each piece of incoming data was retained as information: either it was counted, or it was counted and its price totaled.

This is one way to handle a crapton of craptastic data. In Ruby!



-------
[1] I know, I know, "keeping" is filter and "through" is map and "expand" is flatmap. But you know what? These names don't make that much sense in English. They only make sense to people already familiar with functional programming vocabulary and idioms, and that's not my primary audience.

[2] Why does Ruby's external iterator not have some sort of hasNext method? Or is there one and I can't recognize it?

[3] If I were writing this for production I'd totally keep a stateful count and return itself after each message. But that's a mutable-state optimization, and this is an exercise in functional style.

Sunday, March 31, 2013

Passing functions in Ruby: harder than it looks

People say, "Oh, Ruby has functional programming! We pass blocks around all the time!"

I'm sorry to inform you: Ruby blocks are not first-class functions.

Functional programming is so called because we find it useful to pass functions as values. When we do that, we expect that the function can be called and:
  • it'll get its own level on the stack
  • when it returns, that level of the stack will go away
  • this stack is the same one used by the caller.
When a function returns, whoever calls it gets the return value and goes on its merry way. Sounds easy, right?

Yeah, not in Ruby. See, the common case in Ruby is to pass code in as a block, something like this:
format_all_the_things { |a| "format #{a} into a string" }
Then inside the higher-order function, the block of code is called using yield:
def format_all_the_things
   one = yield "thing 1"
   two = yield "thing 2"
   [one, two].join " & "
end

Here, the block of code is an invisible parameter. It's a little like a first-class function, but not really. For one, blocks don't get their own level on the stack. For another, they don't execute in the same stack as the caller. Best I can tell, this "yield" keyword fires up a coroutine, which gets its own stack. Two violations of my expectations of a first-class function

Danger: return

Use the return keyword in a block, and you invite all kinds of trouble. See, return means it's time to knock a level off the stack. Since the block didn't get its own level in the stack, it has to return from the caller. But in this case, since it was started in a coroutine, there's nothing there to go back to -- LocalJumpError.
 > format_all_the_things { |a| return "yay #{a}" }
LocalJumpError: unexpected return
So, return is your enemy in blocks. Also in Procs.
this is even stranger
wrap the whole thing in a lambda and the block doesn't local jump error; it happily returns from the lambda when it hits the first return keyword inside a block. There's some magic going on here. This works for blocks, but not Procs. Can anyone explain this to me?

> lam = -> { format_all_the_things { |a| return "yay #{a}" } }
> lam.call
 => "yay thing 1"
Ruby has a third option: lambdas. These behave much more like the functions that I know and love.
lam = lambda { |a| return "yay #{a}" }
Lambdas do (seem to) get their own level on the stack, and when they return they return from themselves, not their caller.
Unfortunately this still doesn't work with yield.
> format_all_the_things &lam
LocalJumpError: unexpected return
Dangit! I thought I could trust lambdas, but I was wrong.
Turns out even lambdas behave sanely with return only if accessed using .call() instead of yield. This is the friendly way:
def format_all_the_things(&formatter)
   one = formatter.call("thing 1")
   two = formatter.call("thing 2")
   [one, two].join " & "
 end

format_all_the_things &lam
 => "yay thing 1 & yay thing 2" 
This is the right way to pass and use functions as values in Ruby: lambdas and .call().

Next

"Just don't use return!" you may say. Ruby devs are wonderfully disciplined at avoiding the language's less respectable features.

Sometimes the code is clearer if you can exit the function as soon as you know what the return value is. Especially with one of those handy post-statement ifs:
return "REDACTED" if (a.contains_sensitive_information)
Oddly, there is a control statement that appears to do exactly what I want return to do, and that's next. No longer a loop-control statement, it appears in Ruby 2.0 to be everything we could wish return to be.
> format_all_the_things { |a| next "REDACTED" if (a.include?("1")); "yay #{a}" }
 => "REDACTED & yay thing 2" 

More danger: Break

There are some circumstances where you can use break within a block to end Ruby's internal iteration early.
[4,5,-1,7].each { |n| break :invalid if n < 0 }
This works only with blocks, not with Procs (LocalJumpError) or lambdas (ignored). I think it's evil. Code that exercises flow control on its caller is not a function value.

Conclusion

If you're going to use a functional programming style in Ruby, use lambdas and invoke them with .call().

------
This is all in Ruby 2.0.

Here are my experiments in blocks, Procs, and lambdas with return, break, and next.