Saturday, December 27, 2014

Accidental vs Deliberate Context

In all decisions, we bring our context with us. Layers of context, from what we read about that morning to who our heroes were growing up. We don't realize how much context we assume in our communications, and in our code.

One time I taught someone how to make the Baby Vampire face. It involves poking out both corners of my lower lip, so they stick up like poky gums. Very silly. To my surprise, the person couldn't do it. They could only poke one side of the lower lip out at a time.

Turns out, few outside my family can make this face. My mom can do it, my sister can do it, my daughters can do it - so it came as a complete surprise to me when someone couldn't. There is a lip-flexibility that's part of my context, always has been, and I didn't even realize it.

Another time, I worked with a bunch of biologists. Molecular biology is harder than any business domain I've encountered. The biologists talked fluently amongst themselves about phylogenies and BLAST and PTAM and heterology and I'm making this up now. They shared all this context, and it startled them when developers were dumbfounded by the quantity of it.

Shared context is fantastic for communication. The biologists spoke amongst themselves at a higher level than with others. Unshared context, when I don't realize I'm drawing on a piece others don't share, is a disaster for communication. On the other hand, if I can draw on context that others don't have, and I can explain it, then I add a source of information and naming to the team.

In teams, it's tempting to form shared context around coincidental similarities. The shows we watched growing up, the movies we like, the beer we drink. The culture we all grew up in, the culture we are now immersed in. It gives us a feeling of belonging and connection, shared metaphors to communicate in. It's much easier than communicating with someone from a different culture. There, we have no idea how many assumptions we're making, how much unshared context there is.

Building a team around incidental shared context is cheating. It keeps all the worst of context: the assumptions we don't know we're making. It deprives us of the best of unshared context: the stock of models and ideas and values that one person alone can't hold.

Instead, build a deliberate shared context. Like the biologists have: a context around the business domain, the programming language we use, the coding styles and conventions that make the work flow, that make the code comprehensible. Team culture is important; we should understand each others' code through a shared context that's created deliberately.

Eschew incidental shared context by aiming for a diverse team. Create consciously a context that's conducive to the work.

Thursday, December 18, 2014

My First Leiningen Template

Every time I sit down to write a quick piece of code for a blog post, it starts with "lein new." This is amazing and wonderful: it's super fast to set up a clean project. Good practice, good play.[1]

But not fast enough! I usually start with a property-based test, so the first thing I do every time is add test.check to the classpath, and import generators and properties and defspec in the test file. And now that I've got the hang of declaring input and output types with prismatic.schema, I want that everywhere too.

I can't bring myself to do this again - it's time to shave the yak and make my own leiningen template.

The instructions are good, but there are some quirks. Here's how to make your own personal template, bringing your own favorite libraries in every time.

It's less confusing if the template project directory is not exactly the template name, so start with:

  lein new template your-name --to-dir your-name-template
  cd your-name-template

Next, all the files in that directory are boring. Pretty them up if you want, but the meat is down in src/leiningen/new.

In src/leiningen/new/your-name.clj is the code that will create the project when your template is activated. This is where you'll calculate anything you need to include in your template, and render files into the right location. The template template gives you one that's pretty useless, so I dug into leiningen's code to steal and modify the default template's definition. Here's mine:

(defn jessitron
 (let [data {:name name
             :sanitized (sanitize name)
             :year (year)}]
  (main/info "Generating fresh project with test.check and schema.")
  (->files data
     ["src/{{sanitized}}/core.clj" (render "core.clj" data)]
     ["project.clj" (render "project.clj" data)]
     ["" (render "" data)]
     ["LICENSE" (render "LICENSE" data)]
     [".gitignore" (render "gitignore" data)]
     ["test/{{sanitized}}/core_test.clj" (render "test.clj" data)]))

As input, we get the name of the project that someone is creating with our template.
The data map contains information available to the templates: that's both the destination file names and the initial file contents. Put whatever you like in here.
Then, set the message that will appear when you use the template.
Finally, there's a vector of destinations, paired with renderings from source templates.

Next, find the template files in src/leiningen/new/your-name/. By default, there's only one. I stole the ones leiningen uses for the default template, from here. They didn't work for me immediately, though: they referenced some data, such as {{namespace}}, that wasn't in the data map. Dunno how that works in real life; I changed them to use {{name}} and other items provided in the data.

When it's time to test, two choices: go to the root of your template directory, and use it.

lein new your-name shiny-new-project

This feels weird, calling lein new within a project, but it works. Now
cd shiny-new-project
lein test

and check for problems. Delete, change the template, try again.

Once it works, you'll want to use the template outside the template project. To get this to work, first edit project.clj, and remove -SNAPSHOT from the project version.[3] Then

lein install

Done! From now on I can lein new your-name shiny-new-project all day long.

And now that I have it, maybe I'll get back to the post I was trying to write when I refused to add test.check manually one last time.

[1] Please please will somebody make this for sbt? Starting a Scala project is a pain in the arse[2] compared to "lein new," which leans me toward Clojure over Scala for toy projects, and therefore real projects.

[2] and don't say use IntelliJ, it's even more painful there to start a new Scala project.

[3] At least for me, this was necessary. lein install didn't get it into my classpath until I declared it a real (non-snapshot) version.

Friday, December 12, 2014

Learning to program? It's time for Pairing with Bunny!

My sister Rachel wants to learn to program. "Will you teach me?" Sure, of course!

Teaching someone to program turns out to be harder than I thought. It's astounding how many little tricks and big principles I've absorbed over the years. It's way more than can be passed on in a few years.

Gotta start somewhere, right?

Here it is, then: Pairing with Bunny. In which we record our pairing sessions. In which we start from the beginning, and look dumb, and repeat ourselves, and generally show how a human might learn to program.

So far we have two pieces up on YouTube[1], along with some outtakes and rants that happened. This is unprofessional, it's real, and my sister has a lot of personality. I'm not sanitizing this. It's how it happens.

We're taking programming from the beginning and taking it slow, not getting anything perfect (or even right) from the start. One thing at a time. Can we accomplish anything at this speed? (spoiler: not so far, except some learning.)

If you want a senior programmer's perspective with a complete newbie's questions, these are the videos for you.

More videos are available on the channel.

If this is helpful, comment here, or ping me on twitter, and I'll get my butt in gear at posting more.

[1] I'm thinking about moving them to Vimeo so that they can be available to download. Would that help you?

HDFS Capacity

How much data can our Hadoop instance hold, and how can I make it hold more?

Architectural Background

Hadoop is a lot of things, and one of those is a distributed, abstracted file system. It's called HDFS (for "hadoop distributed file system," maybe), and it has its uses.

HDFS isn't a file system in the interacts-with-OS sense. It's more of a file system on top of file systems: the underlying (normal) file systems each run on one computer, while HDFS spans several computers. Within HDFS, files are divided into blocks; blocks are scattered across multiple machines, usually stored on more than one for redundancy.

There's one NameNode (computer) that knows where everything is, and several core nodes (Amazon's term) that hold and serve data. You can log in to any of these nodes and do ordinary filesystem commands like ls and df, but those are reflecting the local filesystem. It knows nothing about files in HDFS. The distributed file system is a layer above; to query it, you have to go through hadoop. A whole 'nother file manager, with its own hierarchy of what's where.

Why? The main purpose is: stream one file faster. Several machines can read and process one file at the same time, because parts of the file are scattered across machines. Also, HDFS can back up files to multiple machines. This means there is redundancy in storage, and also in access: if one machine is busy it could read from the other. In the end, we use it at Outpace because it can store files that are too big to put all in one place.

Negatives? HDFS files are write-once or append-only. This sounds great: they're immutable, right? until I do need to make a small change, and copy-on-mod means copying hundreds of gigabytes. We don't have the space for that!

How much space do we have?

In our case (using Amazon EMR), all the core nodes are the same, and they all use the local drives (instance stores) to keep HDFS files. In this case, the available space is

number of core nodes * space per node / replication factor.

I can find the number of core nodes and the space on each one, along with the total disk space that HDFS finds available, by logging in to the NameNode (master node, in Amazon terms) and running

hadoop dfsadmin -report 

Here, one uses hadoop as a top-level command, then dfsadmin as a subcommand, and then -report to tell dfsadmin what to do. This seems to be typical of dealing with hadoop.

This prints a summary for the whole cluster, and then details for each node. The summary looks like:

Configured Capacity: 757888122880 (705.84 GB)
Present Capacity: 704301940736 (655.93 GB)
DFS Remaining: 363997749248 (339.00 GB)
DFS Used: 340304191488 (316.93 GB)
DFS Used%: 48.32%

It's evident from 48% Used that I'm going to have problems when I make a copy of my one giant data table. When HDFS is close to full, errors happen.

Here's the trick though: the DFS Remaining number does not reflect how much data I can store. It does not take into account the replication factor. Find that out by running

hadoop fsck /

This prints, among other things, the default replication factor and the typical replication factor. (It can be overridden for a particular file, it seems.) Divide your remaining space by your default replication factor to see how much new information you can store. Then round down generously - because Hadoop stores files in blocks, and any remainder gets a whole block to itself.


The hadoop fs subcommand supports many typical unix filesystem commands, except they have a dash in front of them. For instance, if you're wondering where your space is going

hadoop fs -du /

will show you the top-level directories inside HDFS and their accumulated sizes. You can then drill down repeatedly into the large directories (with hadoop fs -du <dir>) to find the big fat files that are eating your disk space.

As with any abstraction, try to make friends with the concepts inside HDFS before doing anything interesting with it. Nodes, blocks, replication factors ... there's more to worry about than with a typical filesystem. Great power, great responsibility, and all that.

Monday, December 8, 2014

Logs are like onions

Or, What underlying implementation is using?

Today I want to change the logging configuration of a Clojure program. Where is that configuration located? Changing the obvious resources/ doesn't seem to change the program's behavior.

The program uses, but that's a wrapper around four different underlying implementations. Each of those implementations has its own ideas about configuration. How can I find out which one it uses?

Add a println to your program[1] to output this:
In my case the output is:

This is clojure logging's first choice of factories. If it can instantiate this, it'll use it. Now I can google slf4j and find that it... is also a facade on top of multiple logging implementations.

Digging into the slf4j source code reveals this trick:
(class (org.slf4j.LoggerFactory/getILoggerFactory)) 
which prints:
so hey! I am using log4j after all! Now why doesn't it pick up resources/
Crawling through the log4j 1.2 (slf4j seems to use this version) source code suggests this[2]:
(org.apache.log4j.helpers.Loader/getResource "")
which gives me
#<URL file:/Users/jessitron/.../resources/>

So hey, I finally have a way to trace where logging configuration comes from! 

In the end, my guess of resources/ was correct. I forgot to rebuild the uberjar that I was running. The uberjar found the properties file in itself:
Bet I'd have realized that a few hours earlier if I were pairing today. And then I wouldn't have made this lovely post.

[1] or run it in the cider REPL in emacs, in your namespace
[2] actually it checks for log4j.xml first; if that's found it'll choose the xml file over the .properties.