Functors: What the funk for?

For all the programmers who don’t deeply grok the lambda calculus terminology —

Say you are about to call a method on a container, and that container can give you something back of type Tweet. What you really want isn’t a Tweet, but some part of it, say Tweet.getId(). What if, instead of getting the Tweet back from the container and then calling getId() on it, you prefer to tell the container to get the id for you, and return only what you wanted? Then you need a functor.

So you make a quick function literal (tweet => tweet.getId()) that pulls the id out of a Tweet, and pass that in to the functor. The output is the same kind of container, only it holds IDs instead of Tweets. A functor isn’t a special type; it’s a function with a particular purpose.

Why would you want to do this? I used to think that too. Then…

I have an Iterable tweets, and I want to get the ID of the first element. In this example, I’m using Java with the Guava library. The function to call is Iterables.getFirst, which requires a default value in case the iterable is empty. As it happens, I have a default value for the ID, but no Tweet that contains this default ID. Short of constructing a stupid dummy Tweet that holds my default ID, I’m stuck with:

Tweet firstTweet = Iterables.getFirst(tweets, null) 
return firstTweet == null ? defaultId : firstTweet.getId() 

I have to create an identifier and then check stuff on it. This is not declarative enough or simple enough for my taste. I’d rather do this (using Java 8 lambda syntax for brevity):

return Iterables.getFirst(tweets, t => t.getId(), defaultId)

In this hypothetical example, I’m telling that getFirst function to run my function on the tweet before returning. If tweets is empty, then the function can return defaultId. So I’m supplying a default that is meaningful to me, the function is going to return the same type whether there’s a tweet or not, and everyone is happy.

But, that method doesn’t exist. And it would be a pain to an optional function argument to the interface of all the functions in Iterables. Thinking in terms of functors, I can solve this problem with functions that do exist in Iterables:

Iterables.getFirst(Iterables.transform(tweets, t => t.getId()), defaultId)

Here, transform is a functor: it applies the supplied function to the elements inside the iterable, returning an iterable of the function-output type. This means a tweet gets turned into a tweetId before getFirst sees it.

To a long-time Java dev like me, this looks inefficient. Do work on every tweet in tweets just to get the first one out differently? Ahhh, but Guava iterables are lazy! That function is not applied until somebody calls next() on the iterable returned by transform(tweetst => t.getId()). Iterables.getFirst calls next() at most once, so only the first tweet is transformed. Therefore, I have exactly what I wanted: turn that first tweet (if there is one) into an id before giving it back to me. The type of defaultId matches the element type of transform(tweetst => t.getId()).

The lesson of functors is: you don’t have to take something out of a container in order to operate on it.

In OO design, there’s a principle of “Tell, don’t ask.” Classes are supposed to tell each other what to do, rather than asking for internal data. This is an example of that — the Iterable has tweets, and I want the ID of the first one. Using the functor, I tell it “give me the result of this function applied to your tweet.” This is better encapsulation than pulling out the whole tweet and operating on it.

In this example, a functor is a function that does a transformation on data inside a context. In this example the context is an Iterable and its contents are Tweets. The transformation is a functor from Tweet -> ID. This way, an Iterable can give me back an ID, exactly what I wanted in the first place, without me ever having to see its Tweet.

Look! A functional trick can make my Java more OO than ever.


Caveat: there are many definitions for Functor out there, and different types of functors. This is one.

F# made Java easier today

Today I had a sticky problem to solve, the kind business app programmers drool over because it’s nontrivial and contained. Halfway through the TDD process, my logic got all jumbled and I couldn’t see the solution. Stepping back, I took ten minutes to write pseudocode in an F# style, then translated that into Java. Bam! green all over. Turns out this problem is more suited to recursion and pattern matching than loops and variables. The same style works just fine in Java, but thinking in F# brought it to light.

As an illustrative example, here’s the code in both languages.

The problem:
Money comes in through deposits. Money goes out through dispensations. Take an ordered list of each and pair them up into Utilizations, which identify which deposits were used to satisfy which dispensations. If any dispensation cannot be satisfied with the provided deposits, do not satisfy the dispensation at all.

The F#:

type Deposit = { amount : int }
type Dispensation { amount : int }


type Utilization = 
  | DepUtil of Deposit * Dispensation * int


let rec utilizeInternal outputSoFar currentDispOutput (deposits:Deposit list) (dispensations:Dispensation list) depUsed dispUsed : Utilization list = 
 match (deposits, dispensations) with
  | [] , _ -> outputSoFar
  | _ , [] -> outputSoFar
  | dep :: depTail, disp:: dispTail ->
     match(dep.amount – depUsed, disp.amount – dispUsed) with 
       | (depAmt, dispAmt) when depAmt = dispAmt ->
           utilizeInternal (DepUtil(dep, disp, depAmt) :: currentDispOutput @ outputSoFar) [] depTail dispTail 0 0
       | (depAmt, dispAmt) when depAmt  
           utilizeInternal outputSoFar (DepUtil(dep, disp, depAmt) ::  currentDispOutput) depTail (disp :: dispTail) 0 (dispUsed + depAmt)
       | (depAmt, dispAmt) when dispAmt  
           utilizeInternal( DepUtil(dep, disp, dispAmt) :: currentDispOutput @ outputSoFar) [] (dep :: depTail) dispTail (depUsed + dispAmt) 0


let utilize deposits dispensations = utilizeInternal [] [] deposits dispensations 0 0

The key portion, the recursive function definition, works out to nine long lines. It divides the problem up neatly into: the deposit and dispensation match exactly; the deposit is more than the dispensation; the dispensation is more than the deposit. Those are the three non-trivial cases.

Translating this into Java simply puts these cases into if-then statements.

The Java:

public class DepositUtilizer {
    public static class Utilization {
        public final Deposit deposit;
        public final Dispensation dispensation;
        public final long amountInPennies;
        public Utilization(Deposit deposit, Dispensation dispensation, long amountInPennies) {
            this.deposit = deposit;
            this.dispensation = dispensation;
            this.amountInPennies = amountInPennies;
        }
    }
    public Iterable utilize(Iterable deposits, Iterable dispensations) {
        return utilizeInternal(new ArrayList(), new ArrayList(),
                peekingIterator(deposits.iterator()), peekingIterator(dispensations.iterator()), 0l, 0l);
    }
    private Iterable utilizeInternal(Iterable outputSoFar,
            List currentDispensationOutput, PeekingIterator deposits,
            PeekingIterator dispensations, long amountUsedSoFarInCurrentDeposit,
            long amountUsedSoFarInCurrentDispensation) {
        if (!deposits.hasNext() || !dispensations.hasNext()) {
            return outputSoFar;
        }
        long dispensationAmount = dispensations.peek().getAmountForJPA() – amountUsedSoFarInCurrentDispensation;
        long depositAmount = deposits.peek().getUnutilizedAmount().getPennies() – amountUsedSoFarInCurrentDeposit;
        if (depositAmount == dispensationAmount) {
            // great, use them both up
            Deposit usedDeposit = deposits.next();
            Dispensation usedDispensation = dispensations.next();
            currentDispensationOutput.add(new Utilization(usedDeposit, usedDispensation, depositAmount));
            return utilizeInternal(concat(outputSoFar, currentDispensationOutput), new ArrayList(), deposits,
                    dispensations, 0l, 0l);
        }
        if (depositAmount < dispensationAmount) {
            // use all of the deposit
            Deposit usedDeposit = deposits.next();
            currentDispensationOutput.add(new Utilization(usedDeposit, dispensations.peek(), depositAmount));
            return utilizeInternal(outputSoFar, currentDispensationOutput, deposits, dispensations, 0l,
                    amountUsedSoFarInCurrentDispensation + depositAmount);
        }
        if (dispensationAmount < depositAmount) {
            // use all of the dispensation
            Dispensation usedDispensation = dispensations.next();
            currentDispensationOutput.add(new Utilization(deposits.peek(), usedDispensation, dispensationAmount));
            return utilizeInternal(concat(outputSoFar, currentDispensationOutput), new ArrayList(), deposits,
                    dispensations, amountUsedSoFarInCurrentDeposit + dispensationAmount, 0l);
        }
        throw new IllegalStateException(“The sky is falling!”);
    }
}

The Java is twice as long as the F#, sure, but it works beautifully. What matters is that knowing a different language, and thinking about the problem in terms of the better-suited language, led me to a good solution in the language of my day job.

Functional programming! It’s not just for math geeks!

Keeping Your Cache Down to Size

Guava’s CacheBuilder provides several options for keeping a cache down to a reasonable size. Entries can expire for multiple reasons, entries can be limited to a certain number, and entries can be made available for garbage collection. These options can be used individually or together. Each is simply one more method call during cache creation.

When should we use each alternative? It depends on our objective.

General Memory Conservation

If the objective is to avoid running out of memory JVM-wide, then the garbage collector is the right guy for the job. There are three ways to make our entries available for garbage collection. One of them is right, and two are wrong.

Soft Values: this is the right way.

CacheBuilder.newBuilder().softValues().build(cacheLoader)

This sets up a cache which wraps any value it stores in a SoftReference. If no other object holds a strong reference to the value, the garbage collector has the option of freeing this memory. As Brian Goetz put it, “soft references allow the application to enlist the aid of the garbage collector by designating some objects as ‘expendable.'” The key here is that the garbage collector can choose which (if any) soft-referenced objects to collect. The JVM can choose the least recently or least frequently used objects, and it can perform this ranking across all soft-referenced objects, not only the ones in your cache. This makes soft references the optimal choice for overall memory conservation.

When a value is collected, that entry is invalidated in the cache. It may still show up in the total cache.size(), but when cache.get(key) is called, a new value will be loaded and a new entry created.

Now, take my advice: if your goal is to prevent the cache from using all your memory, use softValues() and skip the rest of this blog post. If you are also concerned about frequency of garbage collection, consider maximumSize(), discussed later.

There’s one small caution with soft values: it causes values to be compared with == instead of .equals(). This only impacts the rare person who conditionally removes entries from the cache like this:

cache.asMap().remove(key, value); 

Weak Values: If being in the cache is no reason to keep an object around, you can request weakValues(). This wraps the values in a WeakReference. This means that the next round of garbage collection will sweep up any entries whose value is not referenced elsewhere. These entries get invalidated even if memory is not in short supply. That’s why soft references are better.

The same == comparison caution applies with weakValues as with softValues.

Weak Keys: This means that any entry whose key is no longer referenced elsewhere should be garbage collected and its entry evicted from the cache. If your keys are transient objects expected to fall out of use after a period of time, never to be used again, then weak keys could make sense. I’m not sure why you would base a cache on something like that.

There’s another caveat to weakKeys() — it changes the way the Cache retrieves values. Weak keys are compared with == instead of .equals(). That could have a huge impact on how your map works. When == is the comparison, you can never construct a key; you need a reference to the same key object. With the cache, if you mess this up, you won’t get an error. Instead, the value will be loaded again and a new entry created. The result could be far more entries in the cache than if you hadn’t used weakKeys()!

Note that if your keys are Strings, this == comparison can be even more insidious. Identical hard-coded Strings (like those in your tests) will all reference the same object, and == will work. Strings parsed from input text or files will not be the same reference as any other String, so == will fail and more entries will be loaded into the Cache. What’s more, any hard-coded String will be allocated in PermGen space and not subject to regular garbage collection, so why wrap it in a weak reference? Just don’t do it, people!

Unless your values are smaller than your keys, there’s one more reason to prefer softValues() over weakKeys(). When a key is garbage collected, its entry is invalidated in the cache. However, the entry isn’t fully cleared until the cache does its cleanup activities. That won’t happen until (at the earliest) some call or other is made to the cache. This could delay freeing of the larger memory consumed by the value. Of course, if you’re enough of a memory miser to try using weakKeys(), you probably used weakValues() too, so the relative size of keys and values is not a concern.

Restrict Memory Consumption by the Cache

If your goal is to keep this particular cache down to size, then limiting the maximum size of the cache could make sense. Soft values will prevent the cache from running you out of memory, but it will let your cache take up as much memory as you have, which means more frequent garbage collection and slower overall performance. It is therefore useful to limit the overall size of the cache.

CacheBuilder.newBuilder().maximumSize(100).build(cacheLoader)

That’s all it takes. Your cache.size() will never exceed 100. Entries may be evicted from the cache before the size limit is reached; the cache will start clearing out old, less-used entries as soon as it gets close to the limit.

Get Old Data Out

If your values are perishable, then expire them. Expiration occurs with a fixed time limit, either from last write (when the value was loaded) or last access (when get(key) was called). For instance, if your reference data changes rarely but not never, you might want to fetch it every hour or so.

CacheBuilder.newBuilder().expireAfterWrite(1, TimeUnit.HOURS).build(cacheLoader)

I’m not sure when you would prefer expireAfterAccess. Perhaps your data only changes when you’re not looking?
Remember that if your objective is to keep less-recently-used objects from taking up space, softValues() is superior.

To programmatically expire objects from the cache, call cache.invalidate(key) or cache.invalidateAll(). This will trigger a new load at the next cache.get(key).

When values expire or are invalidated, it doesn’t mean they’re immediately removed. It means that any cache.get(key) will result in a miss and a reload. However, cache.size() might show those entries hanging around. The cache will remove these entries thoroughly during periodic maintenance, which can occur at any cache access. To trigger the maintenance explicitly, call cache.cleanUp().

There are a few useful diagnostic tools to evaluate performance of your cache. This is useful for tuning your maximum size and expiration times. Check out Cache.stats() and removalListener if you want to test or optimize.

No Really, Just Use softValues()

CacheBuilder gives you all kinds of ways to keep your cache under control. It is smarter than your average Map.

Goodbye, MapMaker. Hello, CacheBuilder.

Google has released a new version of Guava, and it’s bad news for one of my favorite classes, MapMaker. The exciting feature of MapMaker is its ability to produce an insta-cache, a key-value mapping that calculates values on the fly, expires them, and limits the quantity stored — all while maintaining thread safety.

In release ten, Guava added CacheBuilder. CacheBuilder produces a Cache: a key-value mapping that calculates values on the fly, expires them, and limits the quantity stored — all while maintaining thread safety.

Wait, that sounds exactly like MapMaker. Poor MapMaker — its methods are now deprecated. The king is dead, long live the king!

Here’s a quick look at what CacheBuilder does:


final Cache cache =
CacheBuilder.newBuilder()
.expireAfterWrite(10, TimeUnit.MINUTES)
.maximumSize(50)
.build(CacheLoader.fromFunction(retrieveInterestingInfoByKey);

Here, retrieveInterestingInfoByKey is a Function that goes to the database, calls a service, or performs some calculation to get the value for a key.

What does this cache do?

The first time cache.get(key) is called, the cacheLoader loads the value, stores the entry in its map, then returns the value. If two threads make this call at the same time, one of them loads the value, and the other one blocks until the value is available. Values are never loaded more times than necessary.

If cache.get(key) is called more than ten minutes after that entry was created, the value will be loaded again. When the size of the cache gets close to the maximum, then the older, less-used entries are evicted.

What is this cache good for? Reference data, certainly. Any time you have data that’s loaded over and over again, but doesn’t change nearly as often as it’s loaded. The whole process can share a single cache. It’s a good candidate for a singleton Spring bean.

What is this cache not good for? Anything you want to cache across processes or machines. Anything you want to update. (You can evict entries from the cache, but not put them in.) Anything you want persisted.

I could write whole posts about exception handling in the load, testing and tuning hooks, and memory use optimization of the Cache. But those are decorations on the cake. The real point of CacheBuilder is how quickly and easily you can get a map that gets you the information you want, when you want it, without hitting the database every single time.

MapMaker, your name was fun to say, but we won’t miss you now that we have CacheBuilder.

Guava: pidgin functional programming in Java

Google’s guava library provides a few constructs that let us use functional style in Java. Personally, I enjoy the slightly more declarative style that results, and have a new game of eliminating loops. Unfortunately, it’s Java. Attempting to do anything functionally is rather ugly. What do you think – worth it? or unreadable?

Before:


public List getUsefulThings() {
List usefulThings = newArrayList(); // this is also a use of Guava
for(Thing t : listOfThings) {
if (t.isUseful()) {
usefulThings.add(t);
}
}
return usefulThings;
}

After:


private static final Predicate THING_IS_USEFUL_PREDICATE = new Predicate() {
public boolean apply(Thing input) {
return input.isUseful();
}
};

public Iterable getUsefulThings() {
return filter(listOfThings, THING_IS_USEFUL_PREDICATE);
}

The goal is to make the code in getUsefulThings() state what the method is doing, rather than how it goes about it. “Look through this list and if something is useful, put it in the output” is a “how.” “Filter this list on this predicate” is a “what.”

I like to put the predicate in a static variable largely because it’s hideous (thanks to Java syntax). The name of the predicate is more readable than the predicate itself.

There is one additional subtle advantage to the Guava method of returning useful things: the filtering doesn’t happen until someone uses the output. The Iterable returned is lazy. If the caller only wants the first usefulThing:


{
Thing usefulThing = getUsefulThings().iterator().next();
doSomething(usefulThing);
}

… then with the Guava solution, the predicate checked against items in listOfThings only until the first useful Thing is found. The rest of listOfThings won’t be processed at all. Lazy evaluation in Java!

Then again, this lazy evaluation can have an even more subtle effect: if the contents of listOfThings are modified after the call to getUsefulThings() but before that Iterable output is used, the Iterable will reflect the updated listOfThings. This goes to prove: mutability is evil.