On-call is broken: Kahneman & Tversky told me so

On-call is broken. It's used to paper over the cracks in systems, instead of as a reliable, reluctantly-used failsafe. The whole way we do it as an industry is just wrong.

Perhaps you disagree, but I think I have some pretty compelling evidence that it's true. I also think we can fix it -- or at least give it a good try.

How? Well, bear with me for a bit. First, let me tell you about two wonderful non-fiction books I devoured recently.

Like the best fiction, great non-fiction has the capacity to make the reader rethink what they do, and see everything anew. Better still, non-fiction has a higher chance of actually being true; or, at the very least, containing something the reader can use in the real world.

So, I have been rethinking things recently, on foot of the following two excellent non-fiction works:

"Thinking Fast and Slow", Daniel Kahneman's masterful exposition of his research with Amos Tversky on the various peculiar cognitive biases possessed by human beings, and to a lesser extent --
Michael Lewis' "The Undoing Project", an account of Kahneman & Tversky’s extremely fruitful partnership.

For those who haven't yet read these -- even fans of fiction rather than non-fiction -- I highly recommend doing so. It really helps to hammer home how weirdly broken our minds actually are. Don't believe our minds have persistent flaws? Well, you've probably seen optical illusions at some point, or other illustrations of the difference between reality and our perception of reality. It turns out that our minds are generally like this, not just the visual processing components; indeed, the actual details of how exactly our mental architectures are shaped, and how they can be confused and misled, are fascinating.

So you might read the above and think something different, but I read the above and thought: on-call is broken. The whole concept is just wrong. Industry practice fundamentally pretends these problems don't exist.

(Yes, you knew that, or perhaps you felt that, but let me elaborate.)

Mental Architectures

The list of cognitive biases -- i.e. the list of ways in which our way of perceiving reality is broken -- investigated by Kahneman and Tversky is in itself fascinatingly baroque. The big picture, and hence the name of the book, is that human cognition appears to be divided into two separate buckets:

a quick-acting, emotional and intuitive, but easily misled “System 1”; and
a more deliberate, nuanced and logical, but much slower “System 2”.

There are various ways for the focus of cognition to pass between these systems, but in brief, System 1 is the avoid-the-sabre-toothed-tiger system, and System 2 is the tool-developing-and-cave-paintings system.

I must admit I find it a little odd to regard my consciousness in this way -- i.e. split into two halves which have different behaviours -- but the evidence that this is in fact true is overwhelming. Without spoiling the book, which again I strongly recommend you read, it turns out that System 1 gets a number of things systemically wrong, which can only be (partially) corrected by deliberate and conscious reasoning on the part of System 2 -- and sometimes not even consistently then, because some of the effects are so powerful as to mislead even deliberate reasoning.

Let's start off with my favourite example.

Anchoring

Anchoring is great. Anchoring is the effect where if I mention something to you, it somehow primes you, or "anchors" you, in a way that is persistent and extremely difficult to compensate for, even if you're consciously aware that it can in theory happen, and even if you've been specifically told to beware of this effect!

For example, K & A performed an experiment where they asked one group of people to guess the result of the following sum in five seconds:

1 * 2 * 3 * 4 ... * 8

and another group to guess the result of:

8 * 7 * 6 * 5 ... * 1

Bizarrely, although the answer is obviously mathematically the same in both cases, the size of the guesses differs depending on the order of the numbers presented; the median of the responses in the first case was 512, and the median in the second was 2250.

Okay, that's System 1, you might say -- it's doing its best given it has five seconds to respond, that doesn't have much to do with Sensible Decisions in the Real World, does it? Sadly, no: there are studies showing that even after being explicitly told anchoring exists, and that it will influence a decision, participants exposed to an anchor will still perform less accurately on estimation tasks than those who were simply never anchored in the first place.

In some ways, this is not a new result -- used-car salespeople take advantage of this, as does any skilled negotiator, and history is full of examples of setting the parameters of the debate beforehand being more powerful than the debate itself. What was new to me was the sheer power of this effect, even when the participants were aware of it.

Availability

The availability heuristic is also fun. The experiment in this case asked participants whether a random word in English were more likely to begin with K, or have K as third letter? Well, most participants were better at coming up with words that began with K rather than had K as the third letter ("initialism" being itself a cognitive quirk) and because those examples were more available to them, proxied that into believing that words beginning with K were more common.

Sadly, of course, the typical English text has twice as many words with K in the third position than beginning with K (e.g. "ask", "acknowledge").

This "because I can think of it" subconscious strategy leads to incorrect estimations of probability in a very wide range of cases; typically overestimations of the probability of things that make the news, and underestimations of things that don't. For example, according to figures compiled by the (American) National Safety Council, you're ten times more likely to die as a result of some accidental, non-deliberate event (e.g. falling, hitting your head) than by being assaulted with a firearm, although vivid descriptions of firearm-related incidents are much easier to bring to mind.

Substitution

Another doozy. The famous example here is "the Linda problem", whereby participants are "told about an imaginary Linda, young, single, outspoken, and very bright, who, as a student, was deeply concerned with discrimination and social justice. Then they [were] asked whether it was more probable that Linda is a bank teller or that she is a bank teller and an active feminist."

Those of you who have been following along will probably not be surprised to hear that overwhelmingly, the participants declared that "feminist bank teller" was more likely than "bank teller". It might even seem like the right thing to you right now, reading it! I know it still sometimes does to me, despite the fact that it is absolutely, categorically wrong -- because, of course, every feminist bank teller is also just a bank teller. It cannot be more probable.

(As a side note, I guess there are perhaps some circumstances where it might be equally probable, but given a sufficiently large population, it is basically certain that you will have some dropouts from the feminist classification who nonetheless exhibit the other properties. This doesn't change the basic assertion that it cannot be more probable.)

Do take a moment, if you're not fully on-board. It's not quite the Full Monty Hall in its unintuitiveness, but it's up there. By the way, the term "substitution" comes from the theory that System 1 is substituting the easier question, "Is Linda a feminist?", perhaps being distracted by the vivid details and story attributed to the person.

Implications

So, I earlier said that I read those books and immediately saw on-call in an entirely new light.

The obvious and necessary question is: why on earth do we get human beings to do on-call at all?

It seems absolutely clear that every single thing about that kind of work (i.e. urgent, complicated, and visceral) and the conditions under which it must take place (i.e. quickly, full of interrupts, often late at night/early morning) essentially guarantees mistakes and serious errors of judgement. The adrenaline flow and necessity for quick response will obviously push us into System 1 thinking, and everything we know about anchoring, availability, probability estimations and so on would appear to be simply completely incompatible with correctly figuring out what is actually going on during a production incident, and how best to resolve it.

With that lens in mind, I've gone back through production incidents in my past. I can think of multiple incidents where I was anchored, either in a number range (e.g. for failure rates) or for potential root causes, and thereby I took more time to resolve the incident than I should have. I also know of several occasions where I presumed the failure that had just paged was another example of a type that had been paging recently, because those failures were more available to my mind -- presumed incorrectly, of course.

I wonder, dear Reader, if you went through your post-mortems, and you looked at the timeline of decision-making and root-cause analysis, could you be fully confident that the cognitive effects described above were not involved?

Almost certainly not, I suspect. Relatedly, I think it almost certainly true that MTTRs in our industry would be generally shorter if those pernicious effects were avoided. (But perhaps this is just the availability heuristic playing havoc with my thinking; surely the only way to be definite is to examine the relevant postmortems in detail.)

Proposals

So, with all that in mind, what to do? Should we just ignore this, and continue to do what we're doing today? Or, conversely, if the above really is true, what would a useful approach be?

I think there are two possible approaches. The first one is attempting to mitigate the effects, and the second one is to do something more fundamental.

Mitigating the effects

One common way of compensating for bias is to involve someone else with different biases. In this context, one obvious technique would be "pair system administration", as per pair programming, where on-callers would get someone else in once an incident crossed some arbitrary threshold. The other person could then potentially provide an unbiased look at the problem, or at least a potentially different set of biases, which might end up usefully cancelling each other out. (Tanya reports similar techniques are used at Heroku, plus a specific timeout for handoff.)

Perhaps more useful, the other person could be delegated to work on background problems -- for example, assembling an accurate timeline (crucial, and difficult to reconstruct) -- in more of a System 2 way, without the pressure that comes from fronting a production incident. (Indeed, the Incident Management training at Google, known as IMAG, explicitly recommends splitting roles in just this way, although perhaps not for those reasons.)

Another approach would be to keep historical dashboards on constant display; rather than showing up-to-minute information, they would show the base rates of particular classes of behaviour, perhaps even proportional distributions of e.g. certain kinds of failures or RPC rates, etc. This might help to somewhat offset anchoring, and would help to supply the availability heuristic with more representative data.

Unfortunately, as we can see from the above, whatever techniques we use are likely to be limited in their effectiveness.

Software is eating your incidents

The second approach is a little more fundamental. Let's talk a little about why humans are on-call for things in the first place.

To my way of thinking, the largest value that a human provides to an on-call system is the ability to provide context and put the system into states it would not naturally reach itself. The ability to, as Douglas Hofstadter once put it, "jump out of the system". The human can realise more quickly than the software -- or, as is more likely, realise at all -- that something is in an exceptional condition, and quarantine the bad data or drain the datacentre, or otherwise perform the required exceptional act. In many production environments, it's simply necessary to have this, since there's no other way to provide that realisation that exceptional conditions hold true, and trigger the right responses.

And so we put people on-call for systems. (Systems, on-call for systems, as it were.)

However, there's another, perhaps more ideological way of looking at it, which is that every time a system gets into an exceptional condition, the software should have known. That is, the actual problem with conventional setups is not that the humans are performing poorly, but that the software isn't reacting correctly.

Think about this for a moment. Not every production incident you encounter is new and unique, right? Indeed, you've probably developed a few canned responses that handle the majority of cases, right? Ranging from quarantining the data, to draining, etc? That right there is a signal that the software should have realised what's going on and invoked those well-understood mechanisms itself.

I would ask therefore, if we have a platonically pure example of something going wrong, where the problem is definite and the response absolutely correct, what value is there in a human doing this, rather than a system? Very little, I would suggest. The element of human judgement, flawed as it is, should be kept for problems at the next level up. Perhaps, like a question asked of me at a previous SREcon, we'd also benefit from applying machine learning to production incidents and behaviour generally, and taking most of the first order effects out of the domain of humans altogether.

You might well say that this is -- yet again -- denigrating and undermining the value that competent operations professionals bring to an organisation. I don't intend that. I've been an operations professional, and in most ways I still am one. But on-call is really hard. People have left the profession because of it. A number of presenters at LISA 2016 talked about how awful the effects are, and you know, I believe them.

But actually, that might not be enough.

I myself have been a member of teams, thankfully not at Google, where the very existence of an operations team was explicitly used as a workaround for bad software design. It seems to me that the disconnect between incident response being perceived as one of the primary values of an operations team, and the -- to me unquestionable -- reality that we are not very good at it as a species, is a problem that must be fixed for the good of the profession.

An important objection

Many of you are probably thinking to yourselves that it's unrealistic to say on-call will ever go away, so why bother trying to push more towards that goal? Internet weather will mess things up even if your own software is perfectly engineered, all kinds of meta problems or system interaction issues will still mess things up, and so on.

Well, it's true. Stuff will still mess up. A perfectly environmentally-aware piece of software is, as yet, unavailable.

But that isn't an excuse for continuing to do things this way, with its known (and sometimes unacknowledged) problems. Something which is broken twice a year is better than twice a day. There is real value to incremental improvement, roofshots rather than moonshots. (After all, enough 10% improvements lined up are a 200% improvement eventually…)

Ultimately, we continue to need better tools, as a profession, to help System 2 behaviour come to the fore during incidents, rather than System 1. We need better monitoring, display of relevant information in a non-misleading way, easier ways of precisely understanding what's changed, good ways to apply new changes, and low impact ways to test hypotheses.

We also need to be constantly improving the operational posture of our software systems -- in the broadest sense -- so that we are iteratively removing whole classes of error from the system. Relying on "infrastructure" to do that beneath "application" is of course another way of making that happen.

The real benefit of the SRE model

To really attack this problem, though, we can't ignore the incentives that lead to the situation we're in. To wit, that product development folks primarily put their time into feature development because the business wants them to, whereas putting the reaction intelligence -- to coin a phrase -- into the software is very rarely a priority for them.

But this is where the SRE model shines: software got us into this, software can get us out of it! The promise of SRE -- engineers with the full capability and authority to change what's running in production -- is actually a useful response to the problems above. Even if you have a small team with not enough bandwidth to do everything, the pressure from a peer team to consider good operable design for software can often tip a pure product development team into actually doing that work anyway because they know it's best practice.

But maybe you don't have SRE where you are. If you don't have an SRE team or it's unrealistic to start one, perhaps the only hope might be to move towards an accepted "default cloud stack", where these kinds of problems are either not present in the first place (e.g. App Engine) or are handled in some well understood and consistent way, because all the product developers start off with those building blocks rather than building things on raw POSIX. Changing the work habits of product developers, and their incentive structure, seems hard unless better building blocks can be supplied.

Ultimately, my answer is to put manners on the software, one way or another. Are there other, better ones?

(Acknowledgements: Cian Synnott, Betsy Beyer, and Léan Ní Chuilleanáin for comments, questions and insights.)

Comments

AlonFebruary 13, 2017 at 10:32 AM
Increasing levels of automation handling routine incidents is how SRE grows in scope. Things that a layer lower would have required paging the on-call (e.g. a disk has broken on a machine running your server) are handled automatically and thus outside the humans' responsibility.

One issue that may occur when this is applied to the extreme, is that the operations team become scared of not using the automation, and do not know what to do manually if the automation fails. We see this e.g. in aviation, where career pilots cannot access their training in an emergency (see e.g. Air France 447).
RMEFebruary 19, 2017 at 9:07 PM
Nit: s/AppEngine/App Engine/ or s/AppEngine/Google App Engine/.

Search This Blog

Hopes and Strategies