Repeating commands in bash: per line, per word, and xargs

In bash (the default shell on Mac), today we wanted to execute a command over each line of some output, separately. We wanted to grep (aka search) for some text in a file, then count the characters in each matching line. For future reference…

Perform a command repeatedly, once per line of input:

grep “themain.log | while read line; do echo $line | wc -c ; done

Here, grep searches the file for lines containing the search phrase, and each line is piped into the while loop, stored each time in variable line. Inside the loop, I used echo to print each line, in order to pipe it as input to word-count. The -c option says “print the number of characters in the input.” Processing each line separately like this prints a series of numbers; each is a count of characters in a line. (here’s my input file in case you want to try it)

      41      22      38      24      39      25      23      46      25

That’s good for a histogram, otherwise very boring. Bonus: print the number of characters and then the line, for added context:

grep “the” main.log | while read line; do echo $(echo $line | wc -c) $line ; done

Here, the $(…) construct says “execute this stuff in here and make its output be part of the command line.” My command line starts with another echo, and ends with $line, so that the number of characters becomes just part of the output. 

41 All the breath and the bloom of the year
22 In the bag of one bee
38 All the wonder and wealth of the mine
24 In the heart of one gem
39 In the core of one pearl all the shade
25 And the shine of the sea
23 And how far above them
46 Brightest truth, purest trust in the universe
25 In the kiss of one girl.

This while loop strategy contrasts with other ways of repeating a command at a bash prompt. If I wanted to count the characters in every word, I’d use for.

Perform a command repeatedly, once per whitespace-separated word of input:

for word in $(grep “the” main.log); do echo -n $word | wc -c; done

Here, I named the loop variable word. The $(…) construct executes the grep, and all the lines in main.log containing “the” become input to the for loop. This gets broken up at every whitespace character, so the loop variable is populated with each word. Then each word is printed by echo , and the -n option says “don’t put a newline at the end” (because echo does that by default); the output of echo  gets piped into word-count.
This prints a lot of numbers, which are character counts of each word. I can ask, what’s the longest word in the file?

for word in $(grep “the” main.log); do echo $(echo -n $word | wc -c) $word; done | sort -n | tail -1

Here, I’ve used the echo-within-echo trick again to print both the character count and the word. Then I took all the output of the for loop and sent it to sort. This puts it in numeric order, not alphabetical, because I passed it the -n flag. Finally, tail -1 suppresses everything but the 1 last line, which is last in numeric order, where the number is the character count, so I see only the longest word.

9 Brightest

If that’s scary, well, take a moment to appreciate the care modern programming language authors put into usability. Then reflect that this one line integrates six completely separate programs.

These loops, which provide one line of input to each command execution, contrast with executing a command repeatedly with different arguments. For that, it’s xargs.

Perform a command repeatedly, once per line, as an argument

Previously I’ve counted characters in piped input. Word-count can also take a filename as an argument, and then it counts the contents of the file. If what I have are filenames, I can pass them to word-count one at a time.

Count the characters in each of the three smallest files in the current directory, one at a time:

ls –Srp | grep -v ‘/$‘ | head -3 | xargs -I WORD wc -c WORD

Here,  ls gives me filenames, all the ones in my current directory — including directories, which word-count won’t like. The -p option says “print a / at the end of the name of each directory.” Then grep eliminates the directories from the list, because I told it to exclude (that’s the -v flag) lines that end in slash: in the regular expression ‘/$, the slash is itself (no special meaning) and $ means “end of the line.” Meanwhile, ls sorts the directories by size because I passed it S. Normally it sorts them biggest-first, but -r says “reverse that order.” Now the smallest files are first. That’s useful because head -3 lets only the first three of those lines through. In my world, the three smallest files are main.log, carrot2.txt, and carrot.txt.
Take those three names, and pipe them to xargs. The purpose of xargs is to take input and stick it on the end of a command line. But -I tells it to repeat the command for each line in the input, separately. And -I also gives xargs (essentially) a loop variable; -I WORD declares WORD as the loop variable, and its value gets substituted in the command.

In effect, this does:
wc -c main.log
wc -c carrot2.txt
wc -c carrot.txt

My output is:

      14 main.log      98 carrot2.txt     394 carrot.txt

This style contrasts with using xargs to execute a command once, with all of the piped input as arguments. Word-count can also accept a list of filenames in its arguments, and then it counts the characters in each. The previous task is then simpler:

ls –Srp | grep -v ‘/$‘ | head -3 | xargs wc -c

      14 main.log      98 carrot2.txt     394 carrot.txt     506 total

As a bonus, word-count gives us the total characters in all counted files. This is the same as typing
wc -c main.log carrot2.txt carrot.txt

Remember that xargs likes to execute a command once, but you can make it run the thing repeatedly using -I.

This ends today’s edition of Random Unix Tricks. Tonight you can dream about the differences between iterating over lines of input vs words of input vs arguments. And you can wake up knowing that long long ago, in a galaxy still within reach, integration of many small components was fun (iand cryptic).

Sniffing traffic between my app and local CouchDB

This turns out to be a lot harder than watching browser traffic.
Forgive the crudity of this post; I’m on a tablet. Gotta get this info out before I forget, because this problem was a sticky one. Couch returned a 400 Bad Request with the error message “invalid json” and I needed to see the json it was receiving.
To watch the HTTP traffic on my local machine (a Mac) moving between my application and CouchDB on port 5984:

  • download and install Wireshark
  • capture traffic on the loopback interface, lo0
  • set up a capture filter for only this port: tcp port 5984 (not entirely sure I got this working, and the capture filter is not strictly necessary)
  • enabled http packet reconstruction on this port: edit, preferences, protocols, http, add 5984 to the lost of tcp ports (this is the critical secret step of working!)
  • ran my test that failed, to cause the traffic I am trying to capture 
  • entered a display filter: http.response.code == 400
  • found the 400 response, clicked on it, then cleared the display filter
  • now I can look a few rows up the list to see the POST that triggered the error. The body is included in the breakdown. (Or would be if it had one. In my case the post has no body; no wonder Couch complained.)

Useful tips:

  • change the coloring rules (the pretty button at the top) to move the “Checksum Errors” rule down to the bottom of the list (this rule is triggered by all of these local packets because of checksum offloading). Then I created a coloring rule to turn http.response.code == 400 an obnoxious color. This made it easier to see what was going on.
  • In this X11 window on a Mac, the command button doesn’t behave normally. Use Ctrl instead.

services on a Mac

Say there’s a program that needs to run on your Mac all the time forever and ever. This post describes how to set that up. The example here sets up the script that runs elasticsearch.*

On a Mac, services are controlled by launchd, which is configured using launchctl. This example uses launchctl to set up a service that starts as soon as it’s configured, starts up automatically at system startup, and gets restarted every time the job dies.

1) Create a configuration (plist) file for it. Name the file like: com.yourorganization.whateverpackage.jobtitle.plist. In the example, mine is org.elasticsearch.test.plist


<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN"
http://www.apple.com/DTDs/PropertyList-1.0.dtd”&gt;

   
        Label
        org.elasticsearch.test
        ProgramArguments
       
            /Volumes/Projects/elasticsearch/bin/elasticsearch
            -f
       
        WorkingDirectory
        /Volumes/Projects/elasticsearch
        StandardErrorPath
        logs/service-err.log
        StandardOutPath
        logs/service-out.log
        OnDemand
       
        KeepAlive
       
        RunAtLoad
       
   

Some important bits explained:
org.elasticsearch.test – the Label is the job’s name. After the job is loaded into launchctl, this is how you can refer to it for start, stop, list.
/Volumes/Projects/elasticsearch/bin/elasticsearch – the ProgramArguments contains the name of the program you want to run, and then any arguments to pass to it. (there’s an alternative to put the program or script name in a separate place, but this is easy. It keeps it right next to its arguments.)
-f – this argument is specific to elasticsearch, but the meaning of it is important. Your program or script should not fork and then exit. This means it should run everything in the foreground and never exit. launchd has a job to keep your script running, so your script needs to keep running.
/Volumes/Projects/elasticsearch –  launchd will cd to the WorkingDirectory before running your program. Your script should not cd.

2) load your plist configuration:

  • copy your plist file to /Library/LaunchDaemons
  • Load it: launchctl load /Library/LaunchDaemons/org.elasticsearch.test.plist (your plist filename)
    • note that whatever user you are when you run this launchctl load will, by default, be thu user the job runs as. If you don’t like that, you can configure the UserName in your plist (man launchd.plist for more info). If you sudo launchctl load, your job will run as root.
  • Check it: launchctl list org.elasticsearch.test (your job label)
    • if you just do launchctl list, it’ll list all the jobs set up by your user. sudo launchctl list to see everything. Grep for yours.
    • listing the specific job gives you some status information. Look for a PID – that means your script is running.
    • if you see no PID and a LastExitStatus of 256, then your job might be forking and exiting. Don’t do that.
  • Now ps -ef | grep for your program, and see that it’s running.
    • Kill it. See whether it gets started again with no action on your part. It should.
    • Check your program’s logs.
  • If you need your program to NOT run constantly, you’ll need to unload it.
    • launchctl unload /Library/LaunchDaemons/org.elasticsearch.test.plist

Great! That’s the whole thing. For way more options, check the man pages. Also, here’s a poorly organized but nonetheless useful reference page.
* there is a service wrapper for elasticsearch, but it didn’t work for us. It doesn’t set the job up to run continually.