Thursday, December 18, 2014

My First Leiningen Template

Every time I sit down to write a quick piece of code for a blog post, it starts with "lein new." This is amazing and wonderful: it's super fast to set up a clean project. Good practice, good play.[1]

But not fast enough! I usually start with a property-based test, so the first thing I do every time is add test.check to the classpath, and import generators and properties and defspec in the test file. And now that I've got the hang of declaring input and output types with prismatic.schema, I want that everywhere too.

I can't bring myself to do this again - it's time to shave the yak and make my own leiningen template.

The instructions are good, but there are some quirks. Here's how to make your own personal template, bringing your own favorite libraries in every time.

It's less confusing if the template project directory is not exactly the template name, so start with:

  lein new template your-name --to-dir your-name-template
  cd your-name-template

Next, all the files in that directory are boring. Pretty them up if you want, but the meat is down in src/leiningen/new.

In src/leiningen/new/your-name.clj is the code that will create the project when your template is activated. This is where you'll calculate anything you need to include in your template, and render files into the right location. The template template gives you one that's pretty useless, so I dugging into leiningen's code to steal and modify the default template's definition. Here's mine:

(defn jessitron
 (let [data {:name name
             :sanitized (sanitize name)
             :year (year)}]
  (main/info "Generating fresh project with test.check and schema.")
  (->files data
     ["src/{{sanitized}}/core.clj" (render "core.clj" data)]
     ["project.clj" (render "project.clj" data)]
     ["" (render "" data)]
     ["LICENSE" (render "LICENSE" data)]
     [".gitignore" (render "gitignore" data)]
     ["test/{{sanitized}}/core_test.clj" (render "test.clj" data)]))

As input, we get the name of the project that someone is creating with our template.
The data map contains information available to the templates: that's both the destination file names and the initial file contents. Put whatever you like in here.
Then, set the message that will appear when you use the template.
Finally, there's a vector of destinations, paired with renderings from source templates.

Next, find the template files in src/leiningen/new/your-name/. By default, there's only one. I stole the ones leiningen uses for the default template, from here. They didn't work for me immediately, though: they referenced some data, such as {{namespace}}, that wasn't in the data map. Dunno how that works in real life; I changed them to use {{name}} and other items provided in the data.

When it's time to test, two choices: go to the root of your template directory, and use it.

lein new your-name shiny-new-project

This feels weird, calling lein new within a project, but it works. Now
cd shiny-new-project
lein test

and check for problems. Delete, change the template, try again.

Once it works, you'll want to use the template outside the template project. To get this to work, first edit project.clj, and remove -SNAPSHOT from the project version.[3] Then

lein install

Done! From now on I can lein new your-name shiny-new-project all day long.

And now that I have it, maybe I'll get back to the post I was trying to write when I refused to add test.check manually one last time.

[1] Please please will somebody make this for sbt? Starting a Scala project is a pain in the arse[2] compared to "lein new," which leans me toward Clojure over Scala for toy projects, and therefore real projects.

[2] and don't say use IntelliJ, it's even more painful there to start a new Scala project.

[3] At least for me, this was necessary. lein install didn't get it into my classpath until I declared it a real (non-snapshot) version.

Friday, December 12, 2014

Learning to program? It's time for Pairing with Bunny!

My sister Rachel wants to learn to program. "Will you teach me?" Sure, of course!

Teaching someone to program turns out to be harder than I thought. It's astounding how many little tricks and big principles I've absorbed over the years. It's way more than can be passed on in a few years.

Gotta start somewhere, right?

Here it is, then: Pairing with Bunny. In which we record our pairing sessions. In which we start from the beginning, and look dumb, and repeat ourselves, and generally show how a human might learn to program.

So far we have two pieces up on YouTube[1], along with some outtakes and rants that happened. This is unprofessional, it's real, and my sister has a lot of personality. I'm not sanitizing this. It's how it happens.

We're taking programming from the beginning and taking it slow, not getting anything perfect (or even right) from the start. One thing at a time. Can we accomplish anything at this speed? (spoiler: not so far, except some learning.)

If you want a senior programmer's perspective with a complete newbie's questions, these are the videos for you.

More videos are available on the channel.

If this is helpful, comment here, or ping me on twitter, and I'll get my butt in gear at posting more.

[1] I'm thinking about moving them to Vimeo so that they can be available to download. Would that help you?

HDFS Capacity

How much data can our Hadoop instance hold, and how can I make it hold more?

Architectural Background

Hadoop is a lot of things, and one of those is a distributed, abstracted file system. It's called HDFS (for "hadoop distributed file system," maybe), and it has its uses.

HDFS isn't a file system in the interacts-with-OS sense. It's more of a file system on top of file systems: the underlying (normal) file systems each run on one computer, while HDFS spans several computers. Within HDFS, files are divided into blocks; blocks are scattered across multiple machines, usually stored on more than one for redundancy.

There's one NameNode (computer) that knows where everything is, and several core nodes (Amazon's term) that hold and serve data. You can log in to any of these nodes and do ordinary filesystem commands like ls and df, but those are reflecting the local filesystem. It knows nothing about files in HDFS. The distributed file system is a layer above; to query it, you have to go through hadoop. A whole 'nother file manager, with its own hierarchy of what's where.

Why? The main purpose is: stream one file faster. Several machines can read and process one file at the same time, because parts of the file are scattered across machines. Also, HDFS can back up files to multiple machines. This means there is redundancy in storage, and also in access: if one machine is busy it could read from the other. In the end, we use it at Outpace because it can store files that are too big to put all in one place.

Negatives? HDFS files are write-once or append-only. This sounds great: they're immutable, right? until I do need to make a small change, and copy-on-mod means copying hundreds of gigabytes. We don't have the space for that!

How much space do we have?

In our case (using Amazon EMR), all the core nodes are the same, and they all use the local drives (instance stores) to keep HDFS files. In this case, the available space is

number of core nodes * space per node / replication factor.

I can find the number of core nodes and the space on each one, along with the total disk space that HDFS finds available, by logging in to the NameNode (master node, in Amazon terms) and running

hadoop dfsadmin -report 

Here, one uses hadoop as a top-level command, then dfsadmin as a subcommand, and then -report to tell dfsadmin what to do. This seems to be typical of dealing with hadoop.

This prints a summary for the whole cluster, and then details for each node. The summary looks like:

Configured Capacity: 757888122880 (705.84 GB)
Present Capacity: 704301940736 (655.93 GB)
DFS Remaining: 363997749248 (339.00 GB)
DFS Used: 340304191488 (316.93 GB)
DFS Used%: 48.32%

It's evident from 48% Used that I'm going to have problems when I make a copy of my one giant data table. When HDFS is close to full, errors happen.

Here's the trick though: the DFS Remaining number does not reflect how much data I can store. It does not take into account the replication factor. Find that out by running

hadoop fsck /

This prints, among other things, the default replication factor and the typical replication factor. (It can be overridden for a particular file, it seems.) Divide your remaining space by your default replication factor to see how much new information you can store. Then round down generously - because Hadoop stores files in blocks, and any remainder gets a whole block to itself.


The hadoop fs subcommand supports many typical unix filesystem commands, except they have a dash in front of them. For instance, if you're wondering where your space is going

hadoop fs -du /

will show you the top-level directories inside HDFS and their accumulated sizes. You can then drill down repeatedly into the large directories (with hadoop fs -du <dir>) to find the big fat files that are eating your disk space.

As with any abstraction, try to make friends with the concepts inside HDFS before doing anything interesting with it. Nodes, blocks, replication factors ... there's more to worry about than with a typical filesystem. Great power, great responsibility, and all that.

Monday, December 8, 2014

Logs are like onions

Or, What underlying implementation is using?

Today I want to change the logging configuration of a Clojure program. Where is that configuration located? Changing the obvious resources/ doesn't seem to change the program's behavior.

The program uses, but that's a wrapper around four different underlying implementations. Each of those implementations has its own ideas about configuration. How can I find out which one it uses?

Add a println to your program[1] to output this:
In my case the output is:

This is clojure logging's first choice of factories. If it can instantiate this, it'll use it. Now I can google slf4j and find that it... is also a facade on top of multiple logging implementations.

Digging into the slf4j source code reveals this trick:
(class (org.slf4j.LoggerFactory/getILoggerFactory)) 
which prints:
so hey! I am using log4j after all! Now why doesn't it pick up resources/
Crawling through the log4j 1.2 (slf4j seems to use this version) source code suggests this[2]:
(org.apache.log4j.helpers.Loader/getResource "")
which gives me
#<URL file:/Users/jessitron/.../resources/>

So hey, I finally have a way to trace where logging configuration comes from! 

In the end, my guess of resources/ was correct. I forgot to rebuild the uberjar that I was running. The uberjar found the properties file in itself:
Bet I'd have realized that a few hours earlier if I were pairing today. And then I wouldn't have made this lovely post.

[1] or run it in the cider REPL in emacs, in your namespace
[2] actually it checks for log4j.xml first; if that's found it'll choose the xml file over the .properties.

Friday, November 28, 2014

A victory for abstraction, re-use, and small libraries

The other day at Outpace, while breaking some coupling, Eli and I decided to retain some information from one run of our program to another. We need to bookmark how far we read in each input data table. How can we persist this small piece of data?

Let's put it in a file. Sure, that'll work.[1] 

Next step, make an abstraction. Each of three configurations needs its own "how to read the bookmark" and "how to write the bookmark."[2] What can we name it?

After some discussion we notice this is basically a Clojure atom - a "place" to store data that can change - except persistent between runs.

Eli googles "clojure persist atom to disk" and bam! He finds a library. Enduro, by @alandipert. Persistent atoms for Clojure, backed by a file or Postgres. Complete with an in-memory implementation for testing. And thread safety, which we would not have bothered with. Hey, come to think of it, Postgres is a better place to store our bookmarks.

From a need to an abstraction to an existing implementation! with better ideas! win!

Enduro has no commits in the last year, but who cares? When a library is small enough, it reaches feature-completion. For a solid abstraction, there is such a thing as "done."

Now, it happens that the library isn't as complete as we hoped. There are no tests for the Postgres implementation. The release! method mentioned in the README doesn't exist.

But hey, we can add these to the library faster and with less risk than implementing it all ourselves. Alan's design is better than ours. Building on a solid foundation from an expert is more satisfying that building from scratch. And with pull requests, everybody wins!

This is re-use at its best. We paused to concentrate on abstraction before implementation, and it paid off.

[1] If something happens to the file, our program will require a command-line argument to tell it where to start.

[2] In OO, I'd put that in an object, implementing two single-method interfaces for ISP, since each function is needed in a different part of the program. In Clojure, I'm more inclined to create a pair of functions. Without types, though, it's hard to see the meaning of the two disparate elements of the pair. The best we come up with is JavaScript-object-style: a map containing :read-fn and :write-fn. At least that gives them names.

REST as a debugging strategy

In REST there's this rule: don't save low-level links. Instead, start from the top and navigate the returned hyperlinks, as they may have changed. Detailed knowledge is transitory.
This same philosophy helps in daily programming work.

Say a bug report comes in: "Data is missing from this report." My pair is more familiar with the reporting system. They say, "That report runs on machine X, so let's log in to X and look at the logs."

I say, "Wait. What determines which machine a report runs on? How could I figure this out myself?" and "Are all log files in the same place? How do we know?"

The business isn't in a panic about this report, so we can take a little extra time to do knowledge transfer during the debugging. Hopefully my pair is patient with my high-level questions.

I want to start from sources of information I can always access. Deployment configuration, the AWS console, etc. Gather the context outside-in. Then I can investigate bugs like this alone in the future. And not only for this report, but any report.

"How can we ascertain which database it connected to? How can I find out how to access that database?"
"How can I find the right source repository? Which script runs it, with which command-line options? What runs that script?"

Perhaps the path is:
- deployment configuration determines which machine, and what repository is deployed
- cron configuration runs a script
- that script opens a configuration file, which contains the exact command run
- database connection parameters come from a service call, which I can make too
- log files are in a company-standard location
- source code reveals the rest.

This is top-down navigation from original sources to specific details. It is tempting to skip ahead, and if both of us already knew the whole path and had confidence nothing changed since last week, we might skip into the dirty details, go right to the log file and database. If that didn't solve the mystery, we'd step back and trace from the top, verifying assumptions, looking for surprises. Even when we "know" the full context, tracing deployment and execution top-down helps us pin down problems.

Debugging strategy that starts from the top is re-usable: solve many bugs, not just this one. It is stateless: not dependent on environmental assumptions that may have changed when we weren't looking.

REST as more than a service architecture. REST as a work philosophy.

Monday, October 27, 2014

Software is a tree

Maybe software is like a tree.

The applications, websites, games that people use are the leaves. They're the important part of the tree, because they're useful. They solve real problems and improve lives, they make money, they entertain.

The leaves are built upon the wood of the tree, branches and trunk: all the libraries and languages and operating systems and network infrastructure. Technologies upon technologies, upon which we build applications. They're the important part of the tree because dozens to thousands of leaves depend on each piece of wood.

Which part of the software tree do you work on? Is it a leaf? Leaves need polished according to their purpose, monitored according to their criticality, grown according to users' real needs. Is it the wood? The wood needs to be strong, very well-tested because it will be used in ways its authors did not imagine.

Ever find yourself in between, changing an internal library that isn't tested and documented like a strong open-source library, but isn't specific to one application either? I've struggled with code like this; we want to re-use it but when we write it we're trying to serve a purpose - grow the leaf - so we don't stop to write property tests, handle edge cases, make the API clear and flexible. This code falls in the uncanny valley between wood and leaf. I don't like it.

Code that is reusable but not core to the business function is best shared. Use another library, publish your own, or else stagnate in the uncanny valley. Follow the mainstream, form a new main
stream, or get stuck in a backwater.

With a few days set aside to focus on it, we could move that shared code into the wood, and release it as open source. Make it public, then document, test, and design with care. When we graft our twig into the larger tree, it can take nourishment from the contributing community, and maybe help more leaves sprout.

If the code isn't useful enough to release as even a tiny open-source library, then it isn't worth re-using. Build it right into multiple leaves if multiple leaves need it. App code is open for modification, instead of the closed-for-modification-but-needs-modification of the shared internal library.

With care, ours can be a healthy tree: millions of shiny custom leaves attached to strong shared branches.