Isolate Failure Domains with PFWF

Failure is not an option: failure is inevitable. As our software systems grow, it gets harder to deny this.

One technique that helps us cope is to isolate failure domains.[1] This means: for each component, understand the scope of impact when failure happens. Limit this to a defined portion of the system – a failure domain. This concept applies in Erlang’s actor model, and in every other system really. The power of the internet is realized because no matter what I change on my personal site, I can’t break any other web application. I can’t screw up anyone’s bank account, anyone’s twitter stream, anything else important. Failure is isolated to my site. [2]

Here’s a real-life example of failure domain isolation. It’s a useful policy for life: Poop, flush, wipe, flush.[3]

See, when one takes a big dump, and then wipes with lots of toilet paper, it’s the TP that’s most likely to clog the toilet. But it’s the poop that makes the clogged toilet extremely gross. Flushing after each step isolates the failure domains: the TP-flush is one failure domain, the poop-flush another. So the consequences of the common failure (TP clogging) are isolated from the more severe consequences of another failure (poop in clogged toilet).

Sometimes the grossest analogies are the most effective.

[1] Michael Nygard, Architecture Without an End State 

[2] Mary Poppendieck, The New New Software Development Model

[3] Something Mario says all the time.