Line endings in git

Git tries to help translate line endings between operating systems with different standards. This gets sooo frustrating. Here’s what I always want:

On Windows:

git config --global core.autocrlf input
This says, “If I commit a file with the wrong line endings, fix it before other people notice.” Otherwise, leave it alone.

On Linux, Mac, etc:

git config --global core.autocrlf false
This says, “Don’t screw with the line endings.”


git config --global core.autocrlf true
This says, “Screw with the line endings. Make them all include carriage return on my filesystem, but not have carriage return when I push to the shared repository.” This is not necessary.

Windows and Linux on the same files:

This happens when you’re running Linux in a docker container and mounting files that are stored on Windows. Generally, stick with the Windows strategy of core.autocrlf=input, unless you have .bat or .cmd (Windows executables) in your repository.

The VS Code docs have tips for this case. They suggest setting up the repository with a .gitattributes file that says “mostly use LF as line endings, but .bat and .cmd files need CR+LF”:

* text=auto eol=lf
*.{cmd,[cC][mM][dD]} text eol=crlf
*.{bat,[bB][aA][tT]} text eol=crlf


When git is surprising you:

Check for overrides

Within a repository, the .gitattributes file can override the autocrlf behavior for all files or sets of files. Watch out for the text and eol attributes. It is incredibly complicated.

Check your settings

To find out which one is in effect for new clones:
git config --global --get core.autocrlf

Or in one repository of interest:
git config --local --get core.autocrlf

Why is it set that way? Find out:
git config --list --show-origin
This shows all the places the settings are set. Including duplicates — it’s OK for there to be multiple entries for one setting.

Why does this even exist?

Historical reasons, of course! (If you have a Ruby Tapas subscription, there’s a great little history lesson on this.)

Back in the day, many Windows programs expected files to have line endings marked with CR+LF characters (carriage return + line feed, or \r\n). These days, these programs work fine with either CR+LF or with LF alone. Meanwhile, Linux/Mac programs expect LF alone.

Use LF alone! There’s no reason to include the CR characters, even if you’re working on Windows.

One danger: new files created in programs like Notepad get CR+LF. Those files look like they have \r on every line when viewed in Linux/Mac programs or (in code) read into strings and split on \n.

That’s why, on Windows, it makes sense to ask git to change line endings from CR+LF to LF on files that it saves. core.autocrlf=input says, screw with the line endings only in one direction. Don’t add CR, but do take it away before other people see it.


I love ternary booleans like this: true, false, input. Hilarious! This illustrates: don’t use booleans in your interfaces. Use enums instead. Names are useful. autocrlf=ScrewWithLineEndings|GoAway|HideMyCRs

CSS Positioning: a summary

I’ve been frustrated every time I try to grasp CSS for years. “I just want this on the left and this to be over here!” etc. Now I realize that CSS doesn’t work that way for very good reasons. In most programming, we give instructions for what we want to happen. But in CSS, it’s more like we are describing a situation — relationships — and then letting the browser figure it out. That’s because the browser has to handle many different circumstances. Resolutions, interfaces, font sizes. I describe how the parts go together, it figures out how to put them on the screen.

When I get upset about the properties for one part being dependent on the properties of its parent containers and siblings, it’s OK Jess: remember that CSS is about interrelationships, so this is normal.

Having got that, I’m now able to learn about how to put things on the screen without yelling in frustration and confusion.

So far, I’ve looked up the position property and learned that it doesn’t do much. There are other ones like display that seem more important. But meanwhile here’s my executive summary:

The position property determines two things: whether the element participates in document flow, and what the properties top bottom left right do. These are the useful ones:

position: static – the default. Stay in document flow; top bottom left right do nothing.

position: relative – nudge. Stay in document flow.  top bottom left right nudge the element in that direction from where it would have been. This also has some effect on child element positioning, at least in the case of ::before pseudo-elements (weird CSS tricks).

position: absolute – override. Remove the element from document flow. top bottom left right tell it where exactly where to be within the document (or within the next absolutely-positioned element up the tree).

position: fixed – override and hold weirdly still. Remove the element from document flow
tell it exactly where to be within the viewport. That means within the browser window (or the iframe if it’s in one). When the page scrolls, this element stays in the same place. People use this for menus.

Please lmk if you have corrections.

Migrating some services from AWS to Pivotal Web Services

My objective is to run some services on Pivotal Web Services (PWS; hosted instance of Pivotal Cloud Foundry), and have them respond to requests to `` at various paths. Currently these services run on AWS, along with services that respond at other subdomains of

TL;DR: this is easy enough for HTTP requests and prohibitively difficult for real HTTPS, for only one subdomain.

This posts describes some tricky bits in this process, and the bits that leave me stuck.

Prerequisites: I have PWS set up and a few apps deployed. Meanwhile all our existing infrastructure runs on AWS.

First: multiple apps responding at

The instructions tell me how to point my own domain at a single app in PWS, but I want multiple apps to serve paths from my domain. The caller should not know or care which service is responding to its request for a resource.

To do this, I set up a route in cloud foundry, with a hostname (which seems to be PCF’s name for the third-from-the-right segment of the domain name, anyone know why?) that doesn’t correspond to any one app.

`cf create-route jessitron –hostname satellite-of-love`

Here, jessitron is my space in PWS. is PWS’s domain, this gets requests into Cloud Foundry for routing. satellite-of-love is a domain name that I like, it matches my github org.

That path is going to 404, but I have called dibs on It’ll route to my jessitron space and no one else’s.

Now I can make routes for each endpoint and tell it which app serves it. For the /vote endpoint on Kitty Survey, I have an app running called london, so I hook that up:

`cf map-route london –path /vote`

Now I can hit and my london app receives a request at path /vote. This is good for testing.

This part totally works with HTTPS. If you don’t mind changing your clients to point to this URL, stop here.

Second: HTTP: pointing to

This is DNS setup. We happen to use AWS Route53 for this. I go into the AWS console to set up a CNAME record for -> There was one tricky bit to this in Route53: I clicked on the existing record (if it didn’t exist I would click Create Record Set), and tried to enter my target BUT NO
It was all “The record set could not be saved because:
– Alias Target contains an invalid value.

Here’s the trick: choose Alias: No.

With a regular CNAME (the Alias ones are an internal-to-AWS thing), I can route to an external domain from Route53.

Next, over in Cloud Foundry land, I can tell it about this domain.

 `cf create-domain atomist`

Here, atomist is my PWS org. Then I tell it to send requests to my space please:

`cf create-route jessitron`

And then I create routes for each of the endpoints, but with this new domain. (I’m pretty sure this is necessary.)

`cf map-route london –path /vote`

I’ll need to make these two routes (or at least the last one) for every endpoint I add to my service. Soon I’ll add this to my “add REST endpoint” automation in Rug.

Third: security certificates and https

There are two ways to get an HTTPS:// endpoint on PWS. They recommend using CloudFlare, which can be free. There are two problems with that. 

CloudFlare -> Cloud Foundry

The first is, to route anything at through CloudFlare, I have to route through CloudFlare. I have to change the routing for my entire company. 🙁
Even if I did reroute our whole domain through CloudFlare, the second problem appears: I can get the appearance of security but not actual end-to-end SSL. The easy option to choose is “Flexible”, meaning users get SSL from browserCloudFlare and it looks secure to them, but behind the scenes it’s HTTP between CloudFlare and my app. This seems unprofessional to me, letting everyone’s requests happen without SSL behind the scenes while telling them it’s secure.
The other option to choose is “Full SSL,” but then I need SSL on Cloud Foundry anyway, so …

SSL in Cloud Foundry 

There’s a Pivotal SSL service available in the PWS marketplace for SSL termination. For $20/month (they don’t mention that in the documentation), it’ll let you upload one certificate.

Currently, we use AWS Certificate Manager, which provides free certificates that only work on AWS.
Can I get that for separately, while leaving the rest of alone? I’m going to try that, from some other source — but not today.

Therefore, because our security certificates are tied to AWS
and because I decline to change the routing of our entire domain in order to experiment with this subdomain,
I give up. My toy website doesn’t need HTTPS anyway.
The moral is: if you want to experiment with moving part of your infrastructure off of AWS (designated by a subdomain), be prepared to change how requests are routed to the root domain.
Thank you to Glenn Oppegard for information about SSL on PWS, and Simon Brown who is finding this just fine with CloudFlare. 

git: handy alias to find the repository root

To quickly move to the root of the current git repository, I set up this alias:

git config --global alias.home 'rev-parse --show-toplevel'

Now,  git home prints the full path to the root directory of the current project.
To go there, type (Mac/Linux only)

cd `git home`

Notice the backticks. They’re not single quotes. This executes the command and then uses its output as the argument to cd.

This trick is particularly useful in Scala, where I have to get to the project root to run sbt compile. (Things that make me miss Clojure!)

BONUS: handy alias to find the current branch

git config --global alias.whereami "rev-parse --abbrev-ref HEAD"

As in,

git push -u origin `git whereami`

Cropping a bunch of pictures to the same dimensions

Ah, command line tools, they’re so fast. And so easy to use on a Mac.

Given a bunch of image files in the same dimensions, that you want to crop to a fixed portion of the image:

1) Install imagemagick

brew install imagemagick

2) put all the images in a directory by themselves, and cd to that directory in the terminal

3) check the size of one of them using an imagemagick command-line utility:

identify IMG_1400.jpg
IMG_1400.jpg JPEG 960x1280 960×1280+0+0 8-bit sRGB 434KB 0.000u 0:00.000

Oh look, that one has a width of 960 and a height of 1280.

4) crop one of them, look at it, tweak the numbers, repeat until you get the dimensions right:

convert IMG_1400.jpg -crop 750x590+60+320 +repage test.jpg

Convert takes an input file, some processing instructions, and an output file. Here, I’m telling it to crop the image to this geometry (widthxheight+xoffset+yoffset), and then make the output size match what we just cropped it to.

The geometry works like this: move down by the y offset and to the right by the x offset. From this point, keep the portion below and to the right that is as wide as width and as tall as height.

5) Create an output directory.

mkdir output

6) Figure out how to list all your input files. Mine are all named IMG_xxxx.jpg so I can list them like this:

ls IMG_*.jpgIMG_1375.jpg IMG_1380.jpg IMG_1385.jpg

7) Tell bash to process them all:[1]

for file in `ls IMG*.jpg`
echo $file
convert $file  -crop
750x590+60+320 +repage output/$file

8) Find the results in your output directory, with the same names as the originals.

[1] in one line:
for file in `ls IMG*.jpg`;> do echo $file; convert $file  -crop 7750x590+60+320 +repage out/$file; done

Spring cleaning of git branches

It’s time to clean out some old branches from the team’s git repository. In memory of them, I record useful tricks here.

First, Sharon’s post talks about finding branches that are ripe for deletion, by detecting branches already merged. This post covers those, plus how to find out more about the others. This post is concerned with removing unused branches from origin, not locally.

Here’s a useful hint: start with

git fetch -p

to update your local repository with what’s in origin, including noticing which branches have been deleted from origin.
Also, don’t forget to

git checkout mastergit merge –ff-only

so that you’ll be on the master branch, up-to-date with origin (and won’t accidentally create a merge commit if you have local changes).

Next, to find branches already merged to master:

git branch -a –merged

This lists branches, including remote branches (the ones on origin), but only ones already merged to the current branch. Note that the argument order is important; the reverse gives a silly error.  Here’s a one-liner that lists them:

git branch -a –merged | grep -v -e HEAD -e master | grep origin | cut -d ‘/’ -f 3- 

This says, find branches already merged; exclude any references to master and HEAD; include only ones from origin (hopefully); cut out the /remotes/origin/ prefix.

The listed branches are safe to delete. If you’re brave, delete them permanently from origin by adding this to the previous command:

 | xargs git push –delete origin

This says, take all those words and put them at the end of this other command, which says “delete these references on the origin repository.”

OK, those were the easy ones. What about all the branches that haven’t been merged? Who created those things anyway, and how old are they?

git log –date=iso –pretty=format:%an %ad %d-1 –decorate

is a lovely command that lists the author, date in ISO format (which is good for sorting), and branches and tags of the last commit (on the current branch, by default).

Use it on all the branches on origin:

git branch -a | grep origin | grep -v HEAD | xargs -n 1 git log –date=iso –pretty=format:”%an %ad %d%n” -1 –decorate | grep -v master | sort

List remote branches; only the ones from origin; exclude the HEAD, we don’t care and that line is formatted oddly; send each one through the handy description; exclude master; sort (by name then date, since that’s the information at the beginning of the line).

This gives me a bunch of lines that look like:

Shashy 2014-08-15 11:07:37 -0400  (origin/faster-upsert)
Shashy 2014-10-23 22:11:40 -0400  (origin/fix_planners)
Shashy 2014-11-30 06:50:57 -0500  (origin/remote-upsert)
Tanya 2014-10-24 11:35:02 -0500  (origin/tanya_jess/newrelic)
Tanya 2014-11-13 10:04:48 -0600  (origin/kafka)
Yves Dorfsman 2014-04-24 14:43:04 -0600  (origin/data_service)
clinton 2014-07-31 16:26:37 -0600  (origin/warrifying)
clinton 2014-09-15 13:29:14 -0600  (origin/tomcat-treats)

Now I am equipped to email those people and ask them to please delete their stray branches, or give me permission to delete them.

HDFS Capacity

How much data can our Hadoop instance hold, and how can I make it hold more?

Architectural Background

Hadoop is a lot of things, and one of those is a distributed, abstracted file system. It’s called HDFS (for “hadoop distributed file system,” maybe), and it has its uses.

HDFS isn’t a file system in the interacts-with-OS sense. It’s more of a file system on top of file systems: the underlying (normal) file systems each run on one computer, while HDFS spans several computers. Within HDFS, files are divided into blocks; blocks are scattered across multiple machines, usually stored on more than one for redundancy.

There’s one NameNode (computer) that knows where everything is, and several core nodes (Amazon’s term) that hold and serve data. You can log in to any of these nodes and do ordinary filesystem commands like ls and df, but those are reflecting the local filesystem. It knows nothing about files in HDFS. The distributed file system is a layer above; to query it, you have to go through hadoop. A whole ‘nother file manager, with its own hierarchy of what’s where.

Why? The main purpose is: stream one file faster. Several machines can read and process one file at the same time, because parts of the file are scattered across machines. Also, HDFS can back up files to multiple machines. This means there is redundancy in storage, and also in access: if one machine is busy it could read from the other. In the end, we use it at Outpace because it can store files that are too big to put all in one place.

Negatives? HDFS files are write-once or append-only. This sounds great: they’re immutable, right? until I do need to make a small change, and copy-on-mod means copying hundreds of gigabytes. We don’t have the space for that!

How much space do we have?

In our case (using Amazon EMR), all the core nodes are the same, and they all use the local drives (instance stores) to keep HDFS files. In this case, the available space is

number of core nodes * space per node / replication factor.

I can find the number of core nodes and the space on each one, along with the total disk space that HDFS finds available, by logging in to the NameNode (master node, in Amazon terms) and running

hadoop dfsadmin -report 

Here, one uses hadoop as a top-level command, then dfsadmin as a subcommand, and then -report to tell dfsadmin what to do. This seems to be typical of dealing with hadoop.

This prints a summary for the whole cluster, and then details for each node. The summary looks like:

Configured Capacity: 757888122880 (705.84 GB)

Present Capacity: 704301940736 (655.93 GB)
DFS Remaining: 363997749248 (339.00 GB)
DFS Used: 340304191488 (316.93 GB)
DFS Used%: 48.32%
It’s evident from 48% Used that I’m going to have problems when I make a copy of my one giant data table. When HDFS is close to full, errors happen.
Here’s the trick though: the DFS Remaining number does not reflect how much data I can store. It does not take into account the replication factor. Find that out by running
hadoop fsck /
This prints, among other things, the default replication factor and the typical replication factor. (It can be overridden for a particular file, it seems.) Divide your remaining space by your default replication factor to see how much new information you can store. Then round down generously – because Hadoop stores files in blocks, and any remainder gets a whole block to itself.


The hadoop fs subcommand supports many typical unix filesystem commands, except they have a dash in front of them. For instance, if you’re wondering where your space is going
hadoop fs -du /
will show you the top-level directories inside HDFS and their accumulated sizes. You can then drill down repeatedly into the large directories (with hadoop fs -du ) to find the big fat files that are eating your disk space.
As with any abstraction, try to make friends with the concepts inside HDFS before doing anything interesting with it. Nodes, blocks, replication factors … there’s more to worry about than with a typical filesystem. Great power, great responsibility, and all that.

Logs are like onions

Or, What underlying implementation is using?

Today I want to change the logging configuration of a Clojure program. Where is that configuration located? Changing the obvious resources/ doesn’t seem to change the program’s behavior.

The program uses, but that’s a wrapper around four different underlying implementations. Each of those implementations has its own ideas about configuration. How can I find out which one it uses?

Add a println to your program[1] to output this:


In my case the output is:


This is clojure logging’s first choice of factories. If it can instantiate this, it’ll use it. Now I can google slf4j and find that it… is also a facade on top of multiple logging implementations.
Digging into the slf4j source code reveals this trick:

(class (org.slf4j.LoggerFactory/getILoggerFactory)) 

which prints:


so hey! I am using log4j after all! Now why doesn’t it pick up resources/
Crawling through the log4j 1.2 (slf4j seems to use this version) source code suggests this[2]:

(org.apache.log4j.helpers.Loader/getResource “”)

which gives me


So hey, I finally have a way to trace where logging configuration comes from! 
In the end, my guess of resources/ was correct. I forgot to rebuild the uberjar that I was running. The uberjar found the properties file in itself:


Bet I’d have realized that a few hours earlier if I were pairing today. And then I wouldn’t have made this lovely post.

[1] or run it in the cider REPL in emacs, in your namespace
[2] actually it checks for log4j.xml first; if that’s found it’ll choose the xml file over the .properties.

Repeating commands in bash: per line, per word, and xargs

In bash (the default shell on Mac), today we wanted to execute a command over each line of some output, separately. We wanted to grep (aka search) for some text in a file, then count the characters in each matching line. For future reference…

Perform a command repeatedly, once per line of input:

grep “themain.log | while read line; do echo $line | wc -c ; done

Here, grep searches the file for lines containing the search phrase, and each line is piped into the while loop, stored each time in variable line. Inside the loop, I used echo to print each line, in order to pipe it as input to word-count. The -c option says “print the number of characters in the input.” Processing each line separately like this prints a series of numbers; each is a count of characters in a line. (here’s my input file in case you want to try it)

      41      22      38      24      39      25      23      46      25

That’s good for a histogram, otherwise very boring. Bonus: print the number of characters and then the line, for added context:

grep “the” main.log | while read line; do echo $(echo $line | wc -c) $line ; done

Here, the $(…) construct says “execute this stuff in here and make its output be part of the command line.” My command line starts with another echo, and ends with $line, so that the number of characters becomes just part of the output. 

41 All the breath and the bloom of the year
22 In the bag of one bee
38 All the wonder and wealth of the mine
24 In the heart of one gem
39 In the core of one pearl all the shade
25 And the shine of the sea
23 And how far above them
46 Brightest truth, purest trust in the universe
25 In the kiss of one girl.

This while loop strategy contrasts with other ways of repeating a command at a bash prompt. If I wanted to count the characters in every word, I’d use for.

Perform a command repeatedly, once per whitespace-separated word of input:

for word in $(grep “the” main.log); do echo -n $word | wc -c; done

Here, I named the loop variable word. The $(…) construct executes the grep, and all the lines in main.log containing “the” become input to the for loop. This gets broken up at every whitespace character, so the loop variable is populated with each word. Then each word is printed by echo , and the -n option says “don’t put a newline at the end” (because echo does that by default); the output of echo  gets piped into word-count.
This prints a lot of numbers, which are character counts of each word. I can ask, what’s the longest word in the file?

for word in $(grep “the” main.log); do echo $(echo -n $word | wc -c) $word; done | sort -n | tail -1

Here, I’ve used the echo-within-echo trick again to print both the character count and the word. Then I took all the output of the for loop and sent it to sort. This puts it in numeric order, not alphabetical, because I passed it the -n flag. Finally, tail -1 suppresses everything but the 1 last line, which is last in numeric order, where the number is the character count, so I see only the longest word.

9 Brightest

If that’s scary, well, take a moment to appreciate the care modern programming language authors put into usability. Then reflect that this one line integrates six completely separate programs.

These loops, which provide one line of input to each command execution, contrast with executing a command repeatedly with different arguments. For that, it’s xargs.

Perform a command repeatedly, once per line, as an argument

Previously I’ve counted characters in piped input. Word-count can also take a filename as an argument, and then it counts the contents of the file. If what I have are filenames, I can pass them to word-count one at a time.

Count the characters in each of the three smallest files in the current directory, one at a time:

ls –Srp | grep -v ‘/$‘ | head -3 | xargs -I WORD wc -c WORD

Here,  ls gives me filenames, all the ones in my current directory — including directories, which word-count won’t like. The -p option says “print a / at the end of the name of each directory.” Then grep eliminates the directories from the list, because I told it to exclude (that’s the -v flag) lines that end in slash: in the regular expression ‘/$, the slash is itself (no special meaning) and $ means “end of the line.” Meanwhile, ls sorts the directories by size because I passed it S. Normally it sorts them biggest-first, but -r says “reverse that order.” Now the smallest files are first. That’s useful because head -3 lets only the first three of those lines through. In my world, the three smallest files are main.log, carrot2.txt, and carrot.txt.
Take those three names, and pipe them to xargs. The purpose of xargs is to take input and stick it on the end of a command line. But -I tells it to repeat the command for each line in the input, separately. And -I also gives xargs (essentially) a loop variable; -I WORD declares WORD as the loop variable, and its value gets substituted in the command.

In effect, this does:
wc -c main.log
wc -c carrot2.txt
wc -c carrot.txt

My output is:

      14 main.log      98 carrot2.txt     394 carrot.txt

This style contrasts with using xargs to execute a command once, with all of the piped input as arguments. Word-count can also accept a list of filenames in its arguments, and then it counts the characters in each. The previous task is then simpler:

ls –Srp | grep -v ‘/$‘ | head -3 | xargs wc -c

      14 main.log      98 carrot2.txt     394 carrot.txt     506 total

As a bonus, word-count gives us the total characters in all counted files. This is the same as typing
wc -c main.log carrot2.txt carrot.txt

Remember that xargs likes to execute a command once, but you can make it run the thing repeatedly using -I.

This ends today’s edition of Random Unix Tricks. Tonight you can dream about the differences between iterating over lines of input vs words of input vs arguments. And you can wake up knowing that long long ago, in a galaxy still within reach, integration of many small components was fun (iand cryptic).