Monday, October 13, 2014

Repeating commands in bash: per line, per word, and xargs

In bash (the default shell on Mac), today we wanted to execute a command over each line of some output, separately. We wanted to grep (aka search) for some text in a file, then count the characters in each matching line. For future reference...

Perform a command repeatedly, once per line of input:

grep "the" main.log | while read line; do echo $line | wc -c ; done

Here, grep searches the file for lines containing the search phrase, and each line is piped into the while loop, stored each time in variable line. Inside the loop, I used echo to print each line, in order to pipe it as input to word-count. The -c option says "print the number of characters in the input." Processing each line separately like this prints a series of numbers; each is a count of characters in a line. (here's my input file in case you want to try it)
      41      22      38      24      39      25      23      46      25
That's good for a histogram, otherwise very boring. Bonus: print the number of characters and then the line, for added context:

grep "themain.log | while read line; do echo $(echo $line | wc -c) $line ; done

Here, the $(...) construct says "execute this stuff in here and make its output be part of the command line." My command line starts with another echo, and ends with $line, so that the number of characters becomes just part of the output. 
41 All the breath and the bloom of the year
22 In the bag of one bee
38 All the wonder and wealth of the mine
24 In the heart of one gem
39 In the core of one pearl all the shade
25 And the shine of the sea
23 And how far above them
46 Brightest truth, purest trust in the universe
25 In the kiss of one girl.
This while loop strategy contrasts with other ways of repeating a command at a bash prompt. If I wanted to count the characters in every word, I'd use for.

Perform a command repeatedly, once per whitespace-separated word of input:

for word in $(grep "themain.log); do echo -n $word | wc -c; done

Here, I named the loop variable word. The $(...) construct executes the grep, and all the lines in main.log containing "the" become input to the for loop. This gets broken up at every whitespace character, so the loop variable is populated with each word. Then each word is printed by echo , and the -n option says "don't put a newline at the end" (because echo does that by default); the output of echo  gets piped into word-count.
This prints a lot of numbers, which are character counts of each word. I can ask, what's the longest word in the file?

for word in $(grep "themain.log); do echo $(echo -n $word | wc -c) $word; done | sort -n | tail -1

Here, I've used the echo-within-echo trick again to print both the character count and the word. Then I took all the output of the for loop and sent it to sort. This puts it in numeric order, not alphabetical, because I passed it the -n flag. Finally, tail -1 suppresses everything but the 1 last line, which is last in numeric order, where the number is the character count, so I see only the longest word.

9 Brightest

If that's scary, well, take a moment to appreciate the care modern programming language authors put into usability. Then reflect that this one line integrates six completely separate programs.

These loops, which provide one line of input to each command execution, contrast with executing a command repeatedly with different arguments. For that, it's xargs.

Perform a command repeatedly, once per line, as an argument

Previously I've counted characters in piped input. Word-count can also take a filename as an argument, and then it counts the contents of the file. If what I have are filenames, I can pass them to word-count one at a time.

Count the characters in each of the three smallest files in the current directory, one at a time:

ls -Srp | grep -v '/$| head -3 | xargs -I WORD wc -c WORD

Here,  ls gives me filenames, all the ones in my current directory -- including directories, which word-count won't like. The -p option says "print a / at the end of the name of each directory." Then grep eliminates the directories from the list, because I told it to exclude (that's the -v flag) lines that end in slash: in the regular expression '/$', the slash is itself (no special meaning) and $ means "end of the line." Meanwhile, ls sorts the directories by size because I passed it -S. Normally it sorts them biggest-first, but -r says "reverse that order." Now the smallest files are first. That's useful because head -3 lets only the first three of those lines through. In my world, the three smallest files are main.log, carrot2.txt, and carrot.txt.
Take those three names, and pipe them to xargs. The purpose of xargs is to take input and stick it on the end of a command line. But -I tells it to repeat the command for each line in the input, separately. And -I also gives xargs (essentially) a loop variable; -I WORD declares WORD as the loop variable, and its value gets substituted in the command.

In effect, this does:
wc -c main.log
wc -c carrot2.txt
wc -c carrot.txt

My output is:
      14 main.log      98 carrot2.txt     394 carrot.txt
This style contrasts with using xargs to execute a command once, with all of the piped input as arguments. Word-count can also accept a list of filenames in its arguments, and then it counts the characters in each. The previous task is then simpler:

ls -Srp | grep -v '/$| head -3 | xargs wc -c

      14 main.log      98 carrot2.txt     394 carrot.txt     506 total

As a bonus, word-count gives us the total characters in all counted files. This is the same as typing
wc -c main.log carrot2.txt carrot.txt

Remember that xargs likes to execute a command once, but you can make it run the thing repeatedly using -I.

This ends today's edition of Random Unix Tricks. Tonight you can dream about the differences between iterating over lines of input vs words of input vs arguments. And you can wake up knowing that long long ago, in a galaxy still within reach, integration of many small components was fun (iand cryptic).

No comments:

Post a Comment