Saturday, August 31, 2013

Idempotence in math and computing

"Idempotent" is a big word. It's one of the four-dollar words thrown around by functional programmers, and it confuses people -- including functional programmers. This is because we don't use it consistently.

In math, idempotence describes only unary functions that you can call on their own output. Math-idempotence is, "If you take the absolute value of a number, and then you take the absolute value of that, the result doesn't change on the second (or subsequent) operations." Math.abs is math-idempotent. Math-idempotence only applies to functions of one parameter where the parameter type and return type are the same. Not so useful in programming.

Here's the real definition.
In computing, an idempotent operation is one that has no additional effect if it is called more than once with the same input parameters.[1]
Programming-idempotence is about side effects. It's about stuff that happens to the outside world when you call a function. Idempotence says "If you've called me once, it doesn't matter whether you called me again."

Examples


Not idempotent: Pushing "Place order" creates a new order in the database.
Idempotent: Pushing "Place order" moves order 4567 from 'cart' to 'finalized' status.

Not idempotent: Fetch a row from the database and create an object to represent it
Idempotent: Fetch a row from Hibernate, which returns the previously allocated object for the same row on a second call.

Idempotent: Give me that pizza.
Not idempotent: Make me a pizza.

Idempotent: check a password against the hash in the database
Not idempotent: check a password and increment the bad-login-attempts if it's wrong

Not idempotent: Install MongoDB
Idempotent: Install MongoDB if it isn't already installed

Idempotent: perform a calculation and return the result
Not strictly idempotent: perform a calculation, write to the log file, and return the result

Keep it idempotent


Programming-idempotence, every function has it or it doesn't. Data-in, data-out functions[2] do not affect the outside world at all, so they are naturally idempotent. If your function does something to the outside world, consider what will happen when your function is called twice. Will it blow up? cause twice as much to happen? or have the same result as calling it once?

It's a good idea to make side-effecting functions idempotent. They make your forms immune to repeated-submission problems. They make your chef scripts re-runnable when a download fails in the first attempt. They avoid surprising people.


The word idempotence came from math, but its meaning in programming context is broader and more useful.[3]

------------
[1] http://stackoverflow.com/questions/1077412/what-is-an-idempotent-operation

[2] "data-in, data-out" is a more accessible term for "Referentially transparent and nullipotent." Nullipotent means that the outside world doesn't care whether you called it 0 or 1 or 100 times.

[3] If you want a mapping from the math-idempotency to programming-idempotency, think of it this way:

All functions (in a language other than Haskell) have an invisible parameter: the world around them.
When we write function application in our program, we supply the declared parameters. The world around
is supplied at runtime.
Similarly for output, our program receives an explicit return value, and there is an invisible return
value as well: the world after our function exits.
Therefore, consider every function application in code as a partial application, with the world as the single
unsupplied parameter. Looked at this way, every function + inputs is a unary operator on the single
parameter World, and it returns another World. Calling our function again with the same inputs, does it change
that World again?

Math-idempotency:
f(x) = f(f(x))

programming-idempotency:
f(a,b)(world).newWorld = f(a,b)(f(a,b)(world)).newWorld

where the method newWorld extracts the world-after-execution from the function's multiple return values (the explicit one and the implied world).


[4] World is one of the most awkward words in the English language.

Friday, August 30, 2013

Def and Val in Scala constructors

TL;DR - use def all the time, and things just work.

Remember that old adage from Java, "Never call a nonfinal method in your constructor"?
Yeah, that still applies in Scala, and it keeps biting me. It's easier to get wrong in Scala.

Here's a refresher:
When a class is instantiated, the superclass is initialized first. Its fields are initialized and its constructor executed. If that constructor calls a method, and that method is overridden in a subclass, and the subclass's version accesses a field that exists in the subclass, then FAILURE: that subclass field has not been initialized. It is null.

Here's what that looks like in Scala:
The initialization of one val (secondsOfWaiting) uses a val that is overridden in the subclass. Try to instantiate a OneSpecificTest, and you get an NPE.[1]
This doesn't happen in Java because you can't override fields.

One way of fixing this is to use def instead of val on the overridden property. A def in the subclass won't be initialized to null. To use def in the subclass we have to use def in the parent class too. It looks like this:
This works because the def in the subclass references only a field in the superclass that has already been initialized. That's coincidence. Reorder the lines in the superclass, and the NPE is back:
If the subclass referenced its own val, that would also NPE. The only safe vals are defined in the superclass before any initialization references the subclass method. In other words, not safe at all!
This doesn't happen much in Java because we declare the fields at the top, and the constructors are contained. In Java, methods and fields are very different; we notice calling a method, and notice its modifiers.

Scala deliberately blurs def and val. Override a def with a val, switch between a val and a def, you don't affect the interface. Scala says, don't care whether a property is a def or a val under the hood. Yet the difference is significant when you implement a class.
In Java, I don't write a block to initialize a field. In Scala, I do; it's another way def and val are easy to interchange. Easy, but not always wise.

The right solution to this initialization problem: make secondsOfWaiting a def instead of a val. Access to longEnough takes place after construction.

lazy val secondsOfWaiting works too; that also delays evaluation.

Initialization order is complicated. It looks easy when everything else is stripped out, but when the class definition is longer, it is not obvious which blocks of code are running at construction and which later. I don't want to think about this!

Can we make it simpler?

Universal solution: use def for everything. When everything is an operation, initialization is not an issue. Order is not an issue. We are free from worry about what happens first.

From now on I'm writing all my properties with def, or function-type vals.
I'll let you know how that goes.

----------------
[1] In Scalatest, the NPE on initialization gets wrapped in an error about being unable to load a test that was found - it's a little cryptic. Next time I see it at work I'll paste the exact one in here and then Google will like me.

[2] But what about performance?? I'm ignoring performance. Premature optimization is the root of all evil, and all optimization is premature until you prove that it isn't.
If you want memoization, lazy val is as good as def. That adds some overhead compared to a regular val, but less than a def if it is called many times. But still, this idea of making everything an operation on the JVM is probably a disaster for performance. That doesn't make it wrong, conceptually.

Tuesday, August 27, 2013

Program relativity: F = d*c^2

Your home and your bank account: one of these is an asset, and one is money. Enter the real estate market, and they become interchangeable, with substantial effort.

Matter and energy: they're the same at some level. If it were easy to go back and forth, we'd all move around at the speed of light.

Code and data: one of these we expect to vary with every run of our program. Perhaps if treat them as the same, if we pass code around and massage it the way we do data, then our programs can change faster.

If you're like "Code is data, OK yeah, it's bytes in a text file," then you're selling your house brick by brick. We pass code around at a conceptual level in functional programming. We work with operations; values feel like the special case.

Here's a tiny example I hit at work today.

Say we have a database, a list of groups, and a countActiveUsers method on the GroupOperations object. We could write that method to operate on the database, using the list of groups, to return the count all the currently-online users in those groups.

def countUsers(groups: Set[Group], db: DB) : Int

Now say I'm starting up a StatusUpdateActor, and I want to give it a way to count the users. The easiest, crappiest way is to pass in all three objects:

new StatusUpdateActor(db, groups, groupOperations)

A cleaner way is to create a closure and pass that in:

val countUsersOperation: () => Int = { groupOperations.countUsers(groups, db) }
new StatusUpdateActor(countUsersOperation)

It can be cleaner still. Consider - what is the GroupOperations class's job? It knows how to do stuff with groups. Knowing how to do stuff doesn't mean you're the one to do it all the time. What if countUsers returned an operation instead of performing an operation?

def countUsers(groups: Set[Group], db: DB) : () => Int

Then we wouldn't have to wrap it in a closure to pass it in to StatusUpdateActor. It's right there, an operation. If we want to run it right now, we can. If we want to pass it around, we can.

Better yet, what if I haven't decided what db to use? Why should I have to provide a DB before getting this operation?

def countUsers(groups: Set[Group]): DB => Int

Now I can get an operation on a DB that returns an Int. I can give it the db right away, or maybe StatusUpdateActor gets a db from a different source, and we want to give it the operation directly.

new StatusUpdateActor( dbSource, groupOperations.countUsers(groups) )

Moving from OO toward FP, I see application no longer as the whole point of a function, and more as the final step in a series of manipulations. Why force a caller to wrap your method in a closure sometimes? Return the operation to begin with, and let the caller apply it when they darn well please.

If you like to use efficiency as an excuse, then perhaps countUsers performs a set operation on the groups and constructs a statement before touching the db. Do all this once and return an operation that uses the statement value once a DB passed is in.

When you perform an operation, you give a man a fish. When you return an operation, you teach him how to fish.

This is one tiny example of focusing on operations. It treats code as data at a high level: here's a function, you can do a few things to it, like supply some or all of its arguments or give it to someone who might call it later. It's moving the mass of our program around at the speed of data. It's a liquid market for our program assets.

People have predicted for the last several years that functional programming is a fad -- sorry dudes, this is no speculative bubble. Buy in, because prices are only going up.

----------------------------------
Would it help if I showed code for the methods declared above? 
And declarations for the called ones? 
I thought it might be more distracting than useful.

Monday, August 26, 2013

Special Relativity finally makes sense

Preface: I studied physics in college, yet there were parts of it that didn't click with me. Special relativity with all its "a flashlight on a train" examples, that's counterintuitive and hard. Yesterday in Surfaces & Essences, the authors explain Einstein's thought process when he came up with it, and it makes sense this way. It's also an interesting example about how genius can happen when we let go of what we know.

You know how when you're on a plane, travelling smoothly at 600 mph, you can pour water the same way as when you're standing still? If you throw a ball straight up, it falls back down into your hand.

This is called "Galilean Relativity." It says that while you're traveling at a constant velocity, mechanical experiments behave the same as at any other velocity (including standing still). It's only changes in velocity -- speeding up, slowing down, bumps and curves -- that affect the motion of objects you have with you. This means there is no mechanical experiment you can perform to find out whether you're flying at 600 mph or sitting on the tarmac. The water and the ball behave the same.

Einstein said, why should electromagnetic experiments be any different? Why should a lens, a magnet, or an electron care about your velocity? Your computer works the same in the air as on the ground, and there are a lot of electrons moving around in there.

The behavior of light passing through a lens or electricity moving through your processor is closely related to the constant speed of light. For your glasses and your computer to behave exactly the same in a moving plane, the speed of light inside the plane must be the same as the speed of light on the ground. The electrons in your computer move the same way, relative to you, at 600 mph or at a stop.

Speed is distance divided by time. The distance that water traveled between the pitcher and the cup looks different to you and to the watchtower, since the watchtower sees the plane moving horizontally. Same with the light from your monitor: it travels a different distance from the tower's perspective than yours. If the distances are different, how can the speed be the same? Speed can be the same if time is different too.

Therefore time is stretched or squashed depending on velocity, so that the speed of light is the same from every perspective. No experiment can tell the difference, and electrons in your computer don't care how fast you're going.

Next time you sip a drink on your way across the Atlantic, appreciate time dilation. You may not notice the fraction of a second this trip adds to your life, but reliable electronics are important at every speed.

In this story, Einstein takes one belief about the world (Galilean Relativity) and broadens it. In doing so, he narrows  another belief about the world (consistent time). The result is a new, self-consistent worldview. This worldview has proven useful.
It pays to be specific about how we know something is true, and the range of experience we can be sure about.

Monday, August 19, 2013

Why Aren't There More Women Programmers

When we feel like we're good at something, that's when we can really learn it, sink our teeth in and love it and do it until we're experts.

It is a great motivator, this feeling of confidence. This belief that we can accomplish what we want to do is called self-efficacy. There are four sources of self-efficacy at a particular task (in order of strength):[1]
  1. doing it
  2. seeing people like me do it
  3. social persuasion
  4. your body
Why aren't there more women programmers? Because many women don't feel like they can do it. Therefore they don't try it, or don't persevere. These sources of self-efficacy explain why women (in aggregate) are less likely to program than men.

1. doing it: If you try something and have success, this is the best source of a feeling of self-efficacy. In my generation, more boys than girls started programming at an early age.

2. seeing people like me do it: If my mom can do it, so can I.
People choose careers by picturing themselves in that job, and the picture is based on people we know. If we can't picture ourselves in a career, we don't consider it. The gender dichotomy is strong in our culture, a big part of whom we consider "like me." A boy can picture himself as a programmer. When a girl looks around, she doesn't see people like her coding, attending conferences, speaking, blogging, contributing to open source. Even women developers: looking up in the hierarchy, they see themselves in management, analysis, QA, BI, maybe DBA. Not system administration. Not architecture.

3. social persuasion: Do my friends approve of it?
Here, we look not at the culture of programming, but the culture of women. When I go to Kindergarten Moms Night and someone asks what I do, "computer programming" usually ends the conversation. I don't fit in. Growing up and being grown up, if you enjoy the company of other women, the social persuasion is against development.

4. your body: Do you get a good feeling in your stomach? I don't know about this one. Is it different for men and women?

Women are just as capable of programming as men, but (in aggregate) they don't feel as capable. If we can change that, then we can change the ratio.

#1 is the most direct route, and many in the community are working on introducing programming to young people, especially young women. Yay for them!

#2 is something else we can change. Increase the visibility of women who are already developers, especially those at the highest level. I want to look up, up on stage or up the decisionmaking structure, and see people like me. Conference organizers who go out of your way to elicit proposals from women, thank you. You're helping.

#3 comes the larger social culture, not the culture of programmers. I can't be a typical mom and a community-involved developer. This is not something the programming community can change. Personally, I see #3 as the most intractable obstacle for women.

As a community, #1 and #2 are the ones we can do something about. And hey, they're at the top of the list! If we keep working at this, we'll reach s critical mass. Once programming is (say) 33% women, then #3 will fix itself. Without intervention, the social pressure and lack of role models combine to attract and retain fewer and fewer women. With work, we can turn that spiral around.

--------------------------

[1] At PyCon Canada, Mel Chua gave a great talk on learning. This list is hers.

Sunday, August 18, 2013

Finding and removing large files in git history

Sometimes it behooves one to erase commits of large binaries from history. Binaries don't play nicely in git. They slow it down. This post will help you find them and remove them like they were never there.

In svn, people only download the latest version of the repository, so if you commit a large binary, you can delete it in a future commit and the rest of your team is unharmed. In git, everyone who clones a repo gets the entire history, so they're stuck downloading every version of every binary ever committed. Yuck!

Therefore, the svn-to-git conversion is a good time to delete all the large binaries from history. Do this before anyone has cloned the repository, before you push all the commits to a shared place like bitbucket or github.

Caution: Never alter commits that are in your repo and someone else's, if you ever plan to talk to their repo again.

Step 1: Identify the large files. We need to search through all of the history to find the files that are good candidates for deletion. As far as I can tell, this is nontrivial, so here is a complicated command that lists the sum of the sizes of all revisions of files that are over a million bytes. Run it on a mac.

git rev-list master | while read rev; do git ls-tree -lr $rev | cut -c54- | grep -v '^ '; done | sort -u | perl -e '
  while (<>) {
    chomp;
    @stuff=split("\t");
    $sums{$stuff[1]} += $stuff[0];
  }
  print "$sums{$_} $_\n" for (keys %sums);
' | sort -rn >> large_files.txt

Please replace master with a list of all branches you care about.
This command says: List all commits in the history of these branches. For each one, list all the files; descend into directories recursively; include the size of the file. Cut out everything before the size of the file (which starts at character 54). Anything that starts with space is under a million bytes, so skip it. Now, choose only the unique lines; that's approximately the unique large revisions. Sum the sizes for each filename, and output these biggest-first. Store the output in a file.

If this works, large_files.txt will look something like mine:

186028032 AccessibilityNative/WindowsAccessibleHandler/WindowsAccessibleHandler.sdf
94973848 quorum/installers/windows/jdk-7u21-windows-x64.exe
93300120 quorum/installers/windows/jdk-7u21-windows-i586.exe
84144520 quorum/installers/windows/jdk-7-windows-x64.exe
83345288 quorum/installers/windows/jdk-7-windows-i586.exe
57712115 quorum/Run/Default.jar

Yeah, let's not retain multiple versions of the jdk in our repository.

Step 2: Decide which large files to keep. For any file you want to keep in the history, delete its line from large_files.txt.

Step 3: Remove them like they were never there. This is the fun part. If large_files.txt is still in the same format as before, do this:

git filter-branch --tree-filter 'rm -rf `cat /full/path/to/large_files.txt | cut -d " " -f 2` ' --prune-empty <BRANCHES>

This says: Create an alternate universe with a history that looks like <BRANCHES>, except for each commit, take its files and remove everything in large_files.txt (which contains the filename in the second space-delimited field). Drop any commits which only affected files that don't exist anymore. Point <BRANCHES> at this new version of history.

Whew. If this worked, then when you push to a brand-new repository for sharing, those binaries won't go. Not in the current revision, not in any history. It is like they were never there.

-------------------------

OH GOD WHAT DID I DO: If you change your mind or mess up, you can undo this operation.
First, look at the history of where your branch has pointed recently:
git reflog <BRANCH>

Here's my output:
→ git reflog bbm2e9429a7 bbm2@{0}: filter-branch: rewrite08d7da5 bbm2@{1}: branch: Created from HEAD

The top line is the filter-branch I just did. The line before that lists the tip of the branch before that crazy filter operation.
I can do git log 08d7da5 to check on it, and git ls-tree 08d7da5 to see what's in it. (If you want all the files to be listed, then git ls-tree -r 08d7da5.)

When I'm sure I want to undo the filter-branch, then:
git checkout <BRANCH>
git reset <BRANCH>@{1}

will put the branch riiiight back where it was. If you don't like the weird @{1} notation, you can use the specific commit name instead, and tell the branch exactly where you want it to be.

It's important to feel safe to experiment. In git, as long as it was ever committed in the last 30 days, you won't lose it.





Saturday, August 17, 2013

Converting from svn to git

Moving from svn to git, git-svn is your best friend. However, it is a recent best friend and doesn't understand you some days.

I'm working on moving a repository from sourceforge (in svn) to bitbucket (in git). Theoretically, I should be able to clone the repo into git locally, then push all the history up to bitbucket. That would be too easy. It's more like this:

1) Build user translations. Create a text file (call it users.txt) that looks like
bobberdude = Bobber Dude <bobberdude@gmail.com>
coleslaw = Cole Slaw <coleslaw@gmail.com>
... for each of the committers to your svn repo. [1]

2) Download from svn to git locally. Theoretically this could be "git svn clone <repo url>". But with this friend, ya gotta be more specific. Tell it to do things right.

git svn clone --stdlayout --prefix 'svn/' -A users.txt https://svn.code.sf.net/p/my_project/code/ my_project

git svn clone says "Copy this svn repo with all its history into git"
--stdlayout says "this repo has a trunk, a directory of branches, and a directory of tags. Get all those."
--prefix 'svn/' says "create the branches and tags with names like svn/trunk and svn/my_branch and svn/tags/my_tag" This is how they'd normally look if we got them from a git remote. By default, git svn doesn't prefix them, which is confusing.
-A users.txt says "Use these username translations to set author/committer name and email in commits"
The URL is where svn lives.
my_project is a directory where I want the repo to be. Otherwise git-svn will create one with the same name as the last part of the repo's URL.

This is going to bring down all the branches and tags, but it doesn't do everything we want. It creates the tags as git branches, and it doesn't make local branches to link up to these remote ones.

If this giant download gets interrupted:
Try going into the repository directory and git svn fetch. This should pick up where it left off.

Caution: 
If you missed any users, it'll stop and tell you about the missing user. Add that person to the file, and then continue with git svn fetch, or by re-running the clone command. It should pick up where it left off.
Caution: 
Branches with spaces in their names can throw it off. In my experience (git 1.8.3.4), this happened after resuming a download after the "you forgot this user" error. I deleted that repo and starting over from the beginning with a fuller user file, and it worked then.

3) Create git branches to match all the remote branches. If there's only a few, you can type
git branch my_branch svn/my_branch
for each of them.

If there's a bunch, here is one giant command:

git for-each-ref refs/remotes/svn --format="%(refname:short)" | sed 's#svn/##' | grep -v '^tags' | while read aBranch; do git branch $aBranch svn/$aBranch; done

This says: Take the name of each remote reference that starts with svn, and give me the reference name. Strip svn/ from it. Skip any that starts with tags. For each line, call the contents of the line $aBranch, and run this command at the shell: create a local branch called $aBranch that tracks svn/$aBranch. 

This happened to create a trunk branch that's a duplicate of master, so delete it:
git branch -d trunk 

Whew, that was a lot of magic to do something that sounds simple in English.

4) Create git tags for all of the svn tags. Git-svn brings the svn tags down as if they were branches with a 'tags/' prefix. I don't know why, but I do know how to fix it.
If there's only a few tags, type this for each:
git tag my_tag svn/tags/my_tag

If there are several, here's a hairy shortcut:
git for-each-ref refs/remotes/svn/tags --format="%(refname:short)" | sed 's#svn/tags/##' | while read aTag; do git tag $aTag svn/tags/$aTagdone

This says: Take the name of each remote reference that starts with svn/tags, and give me the reference nameStrip svn/tags/ from it. For each line, call the contents of the line $aTag, and run this command at the shell: create a local tag called $aTag that points at svn/tags/$aTag. 

5) Hurray! We have all the info we need locally. Now we can do warm-fuzzy git operations.
If your goal is to push this to bitbucket, go there and create a brand-new repo.
Come back to your shiny new local git repo, and tell it about bitbucket:

git remote add origin https://new-bitbucket-repo-url 

This says "Yo, learn about this place called 'origin.' You can talk to it here."

Then shove everything up:

git push -u --all origin
git push --tags origin

That says, "Push all branches and tags up to origin, and then make my branches track that." You can leave off the -u option if you're planning to re-clone fresh for your own work, which is probably a good idea.

Done. Now I can put git-svn on the back burner and hang out with more fun friends, like rebase and reflog.

-----------------------
[1] How to get the list of all committers?
In svn, I don't know. (If you do, please comment)
If you've already cloned the repo into git without supplying a user translation, then do this (mac or linux):
git log --all --format="%aE" | sort -u

git log says "list all the commits"
--all says "log all the branches and tags you know about"
--format="%aE" says "print only the author email"
| (pipe) is linuxy for "send the output to"
sort -u is linuxy for "sort all the lines, removing duplicates."

Sunday, August 4, 2013

A trick for deterministic testing of random behavior

When code doesn't behave the same way every time, it's tough to unit test it. Sometimes we want a random element in the code. For instance, maybe we want an error report now and then when a certain problem occurs, but every time would be too much. How can we test that something occurs 10% of the time?

Code that uses random number generation to determine behavior will never be perfectly predictable. We can run a method many times and expect to get close to the desired percentage, but not always. When tests fail occasionally for legitimate reasons, then you get flickers in the continuous integration build, and then people start ignoring build failures. "Oh, that one just fails sometimes. No big deal." When you hear "The build is red sometimes, no big deal," that's a red flag right there. Let's avoid that.

Code that is random in production should not be random in tests. One way to avoid this is to pass in a seed for random number generation. This gets you the same random number generation each run, but it's cryptic ("Why should I get exactly 36% response from this test?") and brittle to refactoring. If the order of the checks in the tested code changes, the test results for that seed will change.

Here's one way to make the tests nice.

Step 1. Pass in a source of random numbers, defaulting to really-random. (In Scala this is easiest with by-name parameters; in another language, pass a function.)

A simple class whose method returns true 40% of the time:

class Sometimes(rand: => Double = Random.nextDouble) {
 def shouldI : Boolean = rand < 0.4
}

Step 2. In the test, create a sequence that contains numbers evenly distributed from 0 to 1, in a random order:

 def evenDistribution(n: Int) = Random.shuffle(Range(0,n).map(_.toDouble).map(_/n))

Step 3. Get an iterator over that sequence, and use its next() function to produce the random number. In Scala, notSoRandom.next() is evaluated every time the rand parameter is accessed.

 val notSoRandom = evenDistribution(100).iterator
 val sometimes = new Sometimes(notSoRandom.next())

Step 4. Be sure the test evaluates the random generator until the sequence is exhausted. Check that the random behavior occurred the expected number of times.

 def results = Stream.continually(sometimes.shouldI).take(100)
 results.count(t => t) should equal(40)

This achieves randomness in the order of the triggers, while guaranteeing the aggregate frequency that we want to test. I like that it expresses what I want to test ("it returns true 40% of the time") without specifying which forty of the hundred calls returned true.

UPDATE: This code is available, along with a function that'll give you an evenly distribution distribution without knowing the length ahead of time, here: https://github.com/jessitron/NotTooRandom
-----
Full code: test and class