Friday, July 27, 2012

Don Quixote was an Enterprise Architect

We need to invent a word:
?: (n) The goal that you aim for, not because you expect to hit it, but because aiming for it points you in the right direction.
Julian Browne in The Death of Architecture accepts that while an Enterprise Architecture will never be achieved, it can direct our efforts: "Great ideas and plans are what inspires and guides people to focus on making those tactical changes well."

Having a plan is good. Sticking religiously to the plan is bad. The business is constantly changing, and so the architecture should align with the current status and the latest goals. Architecture is a map, but we operate in Street View. The map is always out of date.

So it is in life. As people we are constantly changing. There are objectives to shoot for, some concrete and achievable, others unrealistic but worth trying. Aim for enlightenment, if you like, but redefine what enlightenment means with each learning. Embrace change, and direct it by choosing where to focus your attention. If our goal is readable code, we're never going to transform our entire legacy codebase to our current coding standards. Our coding standards will have changed by then. Instead, as we modify pieces of the code, we make these parts the best they can be according to the direction we've set.

Aim for the mountaintop, but recalibrate the location of that top. Appreciate each foot of ascent, as the good-software mountain is constantly shifting, and each improvement -- each refactor or new test or simplification -- counts as progress. An architecture achieved is an architecture out of date - touch the sky, find that it is made of paper, tear through it and aim again for the highest thing you can see.

Tuesday, July 24, 2012

What makes a functional programmer?

Michael O'Church has a lot to say about the functional programming community. His post, Functional Programming is a Ghetto (in the "isolated, exclusive neighborhood" sense of ghetto) contains some great descriptions of what goes through a programmer's head after he or she learns to think in a functional style. The following is a paraphrase of his points, pointing out that the best functional programmers aren't religious about it.

"What real functional programmers do is 'multi-paradigm'– mostly functional, but with imperative techniques used when appropriate." Writing to a database or to the console or calling services is what makes an application useful. Instead of eschewing these, we try to make all dependencies  and influence on the environment localized and explicit. 
The difference between an imperative and functional thinking is "what should be the primary, default 'building block' of a program. To a functional programmer, it’s a referentially-transparent (i.e. returning the same output every time per input, like a mathematical function) function. In imperative programming, it’s a stateful action." 
Imperative thinks of every line in the code as an action, as doing something. Functional thinks of every line as calculating something, and very specific lines as performing an action with external impact (database access, external API calls, etc). "Immutable data and referentially transparent functions should be the default, except in special cases where something else is clearly more appropriate." 
The result of isolating those external effects is that more code is easily testable. More code can be properly unit-tested, while specific code that interacts with the outside world can only be integration tested: "One needs to be able to know (and usually, to control) the environment in which the action occurs in order to know if it’s being done right" in imperative code." Those are the failure points in our application. Don't bury them under mounds of indirection or scatter them throughout your code.
In sum, "we don’t always write stateless programs, but we aim for referential transparency or for obvious state effects."

Without the Ghetto-slang of referential transparency, partial application,  and reasoning about code, functional thinking boils down to: don't change shit until you have to, and when you have to, call it out.

Friday, July 20, 2012

Database versioning, Android style

Question: how can we track database schema upgrades? How can we make sure our database structure matches the deployed code?

One answer: in Android's SQLite database, they solve this problem by storing a version number in   a database property. When an application opens a connection, the version number in the code is checked against the version number in the database. If they don't match, Android calls a hook to let the application update the schema.

For our web app purposes, we used the idea of storing the version number in the database. We threw it in a table. Our upgrade scripts are separate from the code, but they still do the job of converting data, applying DDL changes, and finally increasing the version number in the database.

An application should fail early if the database is out of date. For this, we created a ConnectionProvider that validates the version number when a database connection is established. Thus, there is a version stored in the database and a version hard-coded into the application. If we forget to upgrade the dev or test database before deploying the corresponding code, we find out on startup.

Programming experience in a different environment made our lives easier this day. Design was simple because we were aware of a pattern that worked in one architecture. It adapted well to ours.

using checkstyle in gradle

The checkstyle plugin brings code quality checks into the gradle build. This way, if team members are using disparate IDEs or have different settings, consistent coding standards are enforced in the build.

Here's how to add checkstyle to your Java build in two steps:
1) In build.gradle, add this:

    apply plugin: 'checkstyle'

    checkstyle {
       configFile = new File(rootDir, "checkstyle.xml")

The configFile setting is not necessary if you use the default location, which is config/checkstyle/checkstyle.xml. More configuration options are listed in the gradle userguide.

If yours is a multiproject build, put that configuration in a subprojects block in the parent build.gradle. This uses the parent's rootDir/checkstyle.xml so the checkstyle configuration is consistent between projects.

2) Create checkstyle.xml. For reasonable default settings, google it and steal someone else's. I took mine from a google-api repo and stripped it down. Here's a really basic example, which checks only for tab characters and unused imports:
<?xml version="1.0" encoding="UTF-8"?>
    "-//Puppy Crawl//DTD Check Configuration 1.3//EN"
<module name="Checker">
<module name="FileTabCharacter"/>
  <module name="TreeWalker">
    <module name="UnusedImports"/>
Visit the checkstyle site to find a million more options for checks.

Thursday, July 19, 2012

When a git branch goes bad

Want to merge some good code into the master branch, but stymied by code you don't care about causing a bunch of conflicts?
The other day, some crap got committed to master and pushed, accidentally. It didn't belong. Since it was pushed to origin, we don't want to change history*. When it was time to do a git-flow release, a bunch of conflicts stymied the merge from release branch into master. I wanted to say, "Merge this branch into master, but take all the code from the branch; don't worry about what's currently in master."
There's a merge strategy for merging in a branch and ignoring the changes into the branch: the "ours" merge strategy. But there's no "theirs" merge strategy for choosing all the code from the branch. Here is one way to accomplish this:
  1. Start the merge
if using git-flow to do a release: git flow release finish versionName
general case of merging something into master (or whatever branch you like; master is an example):
git checkout master
git merge branchWithGoodCode
This leaves the merge open, with conflicts. git status shows the files successfully merged (these are in the index) and the files with conflicts (not yet in the index). While the merge is in progress, git is in a special state, something like the below diagram. The objective is to get all the right files into the index and then commit, which will complete the merge.
  1. Get all the code from the branch 
git checkout MERGE_HEAD -- .
Here, MERGE_HEAD means "the tip of the branch we're trying to merge in." MERGE_HEAD is a ref (pointer to a commit) that exists while a merge is in process. This command pulls the files from there into the working directory and index.
The "." at the end is important: when a path is provided as the last argument to git checkout, then git updates the files without changing your current branch. git checkout with no path will switch to that branch. (The -- is optional, but it makes that . harder to miss.)
  1. Commit to finish the merge
git commit -m "Merge branchWithGoodCode, taking all files from branchWithGoodCode"
Hurray, now the tree looks like this:

* if the commits that I don't care about existed only locally, not on origin, then I could wipe those commits from the history entirely. 
git checkout master
git reset --hard goodBranch
the reset says "take my current branch and move it to point at the same commit as goodBranch." (Keep in mind that a branch is nothing but a label, a pointer to a commit.) The "--hard" says "and while you're at it, replace everything in my working directory with the goodBranch code." The commits that were only on master are gone from the tree, and eventually forgotten.

Git Happens: the movie

If you're using git but you don't really get it, this video is for you. In forty minutes, go from "git is hard" to grasping reset, merge, fast-forward, and (simple) rebase.

Or better yet - come to Chattanooga on 30 August 2012 to see this talk live and expanded at DevLink!

Git Happens on Vimeo

Thanks to Michael Bradley of the St. Louis JavaScript Meetup for creating the video.

Tuesday, July 10, 2012

Real-life git-flow

The concepts of git-flow are elegant and simple, but the examples show only one repository. Using git-flow on a team is a little more complex. This post endeavors to describe the setup process and branching considerations when git-flow is used in a team environment.

If you're the person who wants to bring git-flow into your team, this post is for you. Everyone else on the team needs these concepts of git-flow.

Scenario: the team has been developing on a master branch for a while, and now that we've made our first release, it's time to implement git-flow. Each of our local repositories track origin, which is a bare repo living on a server. So far, origin has just one master branch.

Converting an existing repository to git-flow

For the basics, everyone needs a develop branch, origin needs a develop branch, and our develop branches need to track origin.

To set up git-flow, one person runs this:
git flow init
git push -u origin develop
git branch --set-upstream develop origin/develop
Note: if you have any branches lying around other than master, git flow init will not create the develop branch for you; it wants me to use one of these others. Either delete these branches or create the develop branch manually (git branch develop) before initializing git-flow.
git push -u origin develop creates the develop branch on origin. The "-u" option means "set-upstream," which modifies your develop branch to track the develop branch on origin. After this, git status will tell you when your branch is ahead of origin, behind origin, or both.

Everyone else on the team runs this:
git flow init
Note: if these other guys have a branch lying around other than master, git flow will refuse to create the develop branch. Either delete these branches or create the develop branch manually (git branch develop origin/develop) before initializing git flow. The "origin/develop" at the end of that command sets up the tracking branch.

Multiple Branches + Multiple Team Members

Now that origin has two branches, and each team member is tracking both of them, there are a few other considerations.


If you're used to using git pull, please stop. Please use git fetch. This downloads all the new stuff from origin without modifying your local branches or working directory. Decide which local branches you want to update; when you're ready, rebase.

What used to be "git pull" becomes:

Fetch the updates from origin.

Observe whether we have changes.

Rebase to tack my changes on the end of the changes from origin. 

Why rebase instead of merge? That's a subject for a whole post (or a presentation! come see me at DevLINK), but the short answer is: non-fast-forward merge creates merge bubbles on the develop branch - evidence that development forked and a merge commit with no story to tell. The git-flow model is cleaner if merges are for features. This is my opinion.


I recommend this magic spell:
git config push.default current
Run this once to tell git push that it should (by default) push only the current branch. This way, when you make commits on develop and push them to origin, you won't be confused by error messages that occur because your master branch is behind. 

Sharing branches

Feature and release branches exist only in the repository of the team member who created them. Chances are good you'll want multiple programmers working on these at some point - for instance, when everyone is fixing bugs found in test on the release branch. There's a detailed post on sharing branches by GitGuys, but I'll give you the quick rundown here.

One person starts the release and pushes it to origin:
git flow release start versionName
git flow release publish versionName

Whoever else wants to work on it does this to set up a tracking branch:
git flow release track versionName

Bonus hint: here's how to list branches on origin you're currently tracking:
git remote show origin

Now everyone can push changes to the release branch. When it's complete, one person finishes the release. 
  1. make sure master, develop, and the release branch are all fully up-to-date in your local, because finishing the release will affect all three of these. 
  2. git flow release finish versionName
  3. push both master and develop branches
  4. push the release tag to origin: git push origin versionName
  5. delete the release branch on origin: git push origin --delete release/versionName
Finally, everyone else should delete their release branch.
git branch -d release/versionName


If nothing else, always remember to push the release tag to origin. Anyone looking for that version of the code is going to look for it there. git push origin versionName

Like anything in git, git-flow is easy until it isn't. Remember that you have the full power of git at your disposal; the git-flow commands are only shortcuts. Wrap your head around the commit graph, know what's going on behind the scenes, and keep your chin up. Easy right?

Choices with def and val in Scala

In Scala, one can define class properties with either "def" or "val." Using "def", one can define the property with or without parentheses, and with or without the equal sign. Why is this? What are the consequences?

For illustrative purposes, we'll use a block of code that printing to the console, then returns an int. This shows when the code runs.

import System.out._
class Carrot {
   val helloTen = { println "hello"; 10 }
When a val is initialized to a block of code, the code runs at construction, and the property contains 10. The block only runs once. You can see this in the REPL:

scala> :load carrot.scala
scala> val c = new Carrot()
c: Carrot = Carrot@5fe2b9d1
scala> c.helloTen
res5: Int = 10

If you change val to def, that block of code instead becomes a method. It prints and returns 10 every time it is accessed.

import System.out._
class Carrot {
   def helloTen = { println "hello"; 10 }

Use the REPL to observe that the code runs at every access:

scala> :load carrot.scala
scala> val c = new Carrot()
c: Carrot = Carrot@5689a400
scala> c.helloTen
res7: Int = 10
scala> c.helloTen
res8: Int = 10

Now I'm going to stop showing you what the REPL prints out, because that's boring to read. Try it for yourself.

So "def" or "val" determines whether the block of code runs only at construction, or at every access. There's another consequence: in a subclass, you can override a def with a val or a def, but you can only override a val with a val. When Scala has a val, it knows the value of that expression will never change. This is not true with def; therefore declaring something as val says more than declaring a def.

Let's move on with "def" because it's more flexible for extension. Next decision: use parentheses or not? We can choose between
def helloTen = { println "hello"; 10 }
def helloTen() = { println "hello"; 10 }

The consequences of this decision are: if you include the parentheses in the definition, then the property can be accessed with or without parentheses. If you do not include parentheses in the definition, then the property must be accessed without parentheses.

Here's a summary of our options:
With () No ()
val n/a runs at construction;
override with val;
access with no ()
def runs at every access;
override with val or def;
access with or without ()
runs at every access;
override with val or def;
access with no ()

The idiomatic rule of thumb is: use parentheses if the method changes state; otherwise don't.

Now here's a subtlety. If we use def with or without parentheses, the property can be overridden in a subclass by a def with or without parentheses (or a val without parentheses). This has strange consequences: If I subclass Carrot and override the property, but change whether parentheses follow the property declaration, then the interface of the subclass does not match the superclass.

import System.out._
class Carrot {
    def helloTen = { println ("hello"); 10 }

class Soybean extends Carrot {
    override def helloTen() = 14

On a Carrot, I can access helloTen only without parentheses. On a Soybean, I can access the property with or without parentheses. If I cast a Soybean to a Carrot, then I can access helloTen only without parentheses. Either way, the Soybean's helloTen property evaluates to 14, as a good polymorphic method should.

Stranger still, reverse it: if Carrot defines helloTen with parentheses and Soybean without, then a Carrot (or a Soybean cast to a Carrot) will helloTen with or without parentheses -- but a Soybean will only helloTen without parentheses! Therefore, a method call that works on the superclass errors on the subclass. Does this sound like a violation of LSP to you? Technically instances of the subclass can be substituted for instances of the superclass, but the interface of the subclass is smaller than that of the superclass. Wah? If this makes sense to you, please comment.

For another method-declaration subtlety, consider the equals sign.

I'm running Scala 2.9.2.

Sunday, July 8, 2012

Mental revolution

The book Structure of Scientific Revolutions (1962, Thomas Kuhn) brought the word paradigm into common use. Ideas from fifty years ago about research science apply today to computer science and the software industry. This is one parallel Kuhn makes, extended to illuminate how we think about programming.

The human mind does not perceive objects incrementally, feature by feature. We grasp images as coherent wholes. Sometimes when we step back, we can see the object in a new way. When we step back even farther, we might perceive it as something more basic and universal.

For instance, consider:
What do you see first, the duck or the rabbit? Now that I mention them, do you see them both? You can switch back and forth between them at will. If you stare at the picture long enough, you might see it instead as a collection of lines. You can perceive these lines as forming a duck or a rabbit, but in both cases you see the more universal nature of the lines.

This is like the wave-particle duality of light. A millenium ago, light was pictured as a particle. About five hundred years ago, Descartes observed that it behaved like a wave. Newton saw light as particles. Maxwell showed that light is waves. Einstein saw the particle nature of light again, and finally combined the two. Stepping back, it turns out that all particles have wave-like properties. Just as all duck drawings and all rabbit drawings are composed of lines, all matter has a quantized wave nature.

This same process of discovery can apply to software design. Take, for instance, the Strategy Pattern. Say we need to tell our SearchEngine at construction which RankingAlgorithm it should use. In an object-oriented design, such as for Java, it looks like this:
From a functional programming paradigm, it's simpler. For a first implementation, we might give the search a lambda or a reference to a ranking function that translates a document to something orderable.

search :: (Document -> Ord) -> Doc[] -> String

These are two ways to view the same solution - either we're implementing the Strategy pattern, or we're passing in a function. When both of these views make sense, take a step back and observe that in both cases we're doing the same thing: we're passing a behavior. Look at it as a Strategy or as function-passing, whatever is better suited to our environment. Understanding both views and the common principle behind them gives a broader perspective on design.

This is a simple example, but the principle applies to many concepts in programming. Playing with Hadoop and then doing list processing in a functional language can show the common nature of map-reduce. Seeing the commonalities between two views of the same concept shows what is core to map-reduce and what is an implementation detail specific to Hadoop.

Get too focused on the duck, and you might never see the lines. In software, as in life, it helps to adopt a new paradigm. After we see a problem in a new light, we can step back and look for the broader concepts. This leads to a new level of understanding we can never see if we keep insisting that light is a wave and that picture has a beak.

Saturday, July 7, 2012

in which I exhibit sexism in the same tweet that I complain about it

The other day, a scientist announced the detection of the Higgs Boson in a presentation using the much-maligned Comic Sans font. Dialog about the discovery focused on the design faux pas. I made this tweet, which felt pretty clever at the time:

Other people found it pretty clever too. 76 retweets, 60k impressions, my most-retweeted ever by a factor of five.

Except.. the new elementary particle wasn't announced by a dude.

My apologies to Fabiola Gianotti. Congratulations, as well.

I made the assumption "prominent scientist = dude." Pardon me while I smack myself in the forehead.

Friday, July 6, 2012

Site search sanity

Search is a component of most web sites. Therefore, it is a problem solved many times before. The solution of choice (at least in the Java sphere) is Lucene. Insert documents in an index, build queries to find the ones you want. Lucene is a library, so there are a bunch of other tools that wrap Lucene for easier interaction.

Use Solr, use elasticsearch, use Lucene directly - you still have to figure out two things: get your documents into the index, and get the relevant ones out. This post is about getting them out in an ordinary site-search sort of scenario.

For our purposes here, the documents have been indexed with default elasticsearch mappings. This means their fields have passed through the default analyzer, which breaks them down into terms (words), puts these in lowercase, and throws some extremely common words (stop words). The search text will go through the same default analyzer so that we're comparing apples to apples, and not Apples or APPLES or "apples and oranges."

What does a reasonable Lucene-style query for site search look like? There's documentation out there about the query API, all about what to type when you know what kind of query you want - but what kind of query do I want?

Our indexed documents each have two fields: name and description. The search match some text against both fields. Handle some misspellings or typos. Emphasize exact matches in the name field. This seems pretty straightforward, but it isn't trivial. It involves a compound query combining an exact phrase match, exact term matches, and fuzzy term matches.


The outer query is a BoolQuery. A BoolQuery is more complex than AND/OR logic, because there's more to a query than a "yes" or "no" on each document -- there is an ordering to the results.

There are three ways to construct a BoolQuery:
  • only "must" components. This is like a big AND statement: each record returned matches every "must" query component.
  • only "should" components. This is a lot like an OR statement; each record returned matches at least one of the "should" query components. The more "should" components that match the record, the higher the record's rank.
  • a mix of "must" and "should" components. In this case, the "must" clauses determine which records will be returned. The "should" components contribute to ranking.
For the simplest site search, all the different text queries go in as "should" components. We're taking the giant-OR route.

Phrase Query

The first subquery is a TextQuery with type "phrase." This is elasticsearch parlance; Lucene has a  PhraseQuery. The objective here is to find the exact phrase the user typed in. A slop of 1 means there can be one extra word in between the words in the phrase. Increasing the slop to 2 will match two-word phrases with the words out of order. Adding a boost of 4 tells Lucene that this query is 4 times as important as the other queries.

Text Query

The other text queries have a type of boolean (which is the default in elasticsearch). These will find any match on any of the terms in the search text. There is one for each field that I'm searching in.


The intention here is to match terms that are close to the ones entered. It'll match words that are a few characters off from the search text. The details are fuzzy to me, but that's compatible with this objective. Increasing maxQueryTerms increases the flexibility of the match but can slow performance. Minimum similarity can be raised (toward 1) if the users think your matches are too flexible. Prefix length could be set to 1 to required that the first letters of the words matched are the same. Finally, a boost of less than one reduces the ranking of these matches compared to the exact term matches from the text queries.

The result of all this is a search that emphasizes exact matches. It includes near-matches, but puts them toward the end of the results.
If you have any suggestions to make this better, please, leave a comment.

For elasticsearch users, this is the code in the Java API:

BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();

  new TextQueryBuilder("name",text).type(Type.PHRASE)
  new TextQueryBuilder("name", text).type(Type.BOOLEAN));
  new TextQueryBuilder("description", text).type(Type.BOOLEAN));        
  new FuzzyLikeThisQueryBuilder(
     new String[] {"name", "description"})

final SearchRequest searchRequest = 
  new SearchRequestBuilder(client).setIndices("indexName")                                   .setQuery(boolQuery).setFrom(0)

final SearchResponse response =;

Tuesday, July 3, 2012

the skinny on git-flow

Today we switched over from chaotic use of git to the more systematic git-flow. We made our first release, we finally have production code to track, so it is time to get organized. Here's the fifteen-minute rundown for coworkers:

Git-flow is a methodology for using git, and a git extension that streamlines it. Here's what you need to know.

Immortal Branches

There are two branches that last forever. 
master: Every commit on master is a release version. Every commit is tagged with its version number. Before code is merged into master, it's been through all testing and is ready for prod. If you automate production deployment, it'll work off the master branch.  
develop: Work happens on the develop branch. Small changes are direct commits here. Features will be merged in here. develop is always ready for new functionality.

Mortal Branches

There are three species of branches that come and go. First, let's take a look at how code gets from develop to master.

release: When the app is feature-complete and ready to release to test, it's time for a release branch. The release branch is named release/versionNumber. Release branches start from develop. Once a release branch is created, develop represents the next version of the app, while the release branch holds the version under test. Commit any fixes to the release branch. 

When testing is complete, finish the release. The release branch is merged into master (and develop, so the fixes get into the next version). Goodbye, release branch. Your commits will remain in our tree forever, but your name is no longer needed.
The next mortal branch is familiar to git developers, even renegades like us who have been committing to master for a year.
feature: A feature is a piece of new functionality that will be include several commits, span more than a few hours of development, or be worked on by multiple people. Branching groups the commits together for posterity. It allows me to switch at will between work on the feature and fixes or investigations on develop. Starting a feature creates a feature/featureName branch off develop:

Finishing a feature merges back these commits back into develop. The feature branch is deleted, but the commits are still grouped. The merge has the feature branch name in its commit message for posterity(1). Note that a feature can start work while one release is under development, and wind up in the next release.

What happens when bug reports come in from the users? How do we get the fixes into production code after develop has moved on with work for the next release?

hotfix: The third species of mortal branch is hotfix. This branches off master, and fixes made here wind up in a new production release. The name of the hotfix branch is hotfix/newVersion.
When the hotfix is done, it is merged back into master, forming a new production version. It also gets merged back into develop, to keep develop up with the latest.

That's It

That's the process of git-flow. Two immortal branches, three mortal branch species. master contains only tagged production releases, while develop is always ready for new functionality. Features come off develop and go back to develop. Releases move code from develop to master, with testing in between. Hotfixes branch off master and go back to master as a new release.

Minor complications occur when this system is implemented on a real team across repositories, but that is another post.

(1) If the feature branch has only one commit, then finishing might do a fast-forward with no merge commit.

Monday, July 2, 2012

Saint Louis, Technology City Extraordinaire

Saint Louis is a sweet spot to be a developer right now because there are tons of jobs.

Saint Louis is a sweet spot to be a passionate developer because we have an amazing user group community. There are active user groups for Java (two), Ruby, JavaScript, and .NET (two). There's user groups for mobile, AppleHadoop, vim, and Perl. Then there's Lambda Lounge for the really interesting stuff that you won't use in your day job.  If you want to skip the talking and just code, you can't beat Code Til Dawn!

All of these groups are friendly to attendees and speakers. We have many great recruiting firms that sponsor food and space, sometimes even flying speakers in from out of town.

Now here's the real skinny - the user groups that I attend regularly ranked according to alcohol content:

1. STLJS - it's above a bar. The bar has scotch. It closes at midnight, but we're still drinking and talking about code past that. Food: sandwiches
2. Lambda Lounge - we bring beer, and go to the Hive (nearest bar) for more drinks afterward. Food: pizza
3. ALT.NET - some of us bring beer, and at least once the speaker brought vodka. Food: snacks at best
4. JUG - no beer. No bar. Only technical. But - Ted Neward in July! Food: pizza

Special kudos to ALT.NET, because Nick recruits some of the best speakers: Steve F-in Bohlen, Ken Sipe, me. It is a more intimate user group than the others, with a smaller venue and great discussions, and beer. The topics are technical and a good depth. Now if only I could get those guys to the bar afterward, it'd be perfect.