1 Jul 2021

Noise: The unwanted variability in human judgement

From Nine To Noon, 10:06 am on 1 July 2021

Two doctors treating identical patients can give different diagnoses, two judges in the same court can give different sentences to people who have committed the same crime.

Interviewers of job candidates can make widely different assessments of the same people.

Even fingerprint examiners sometimes differ in deciding whether a print found at a crime scene matches a suspect.

Photo: Supplied

These same doctors, judges, interviewers or forensic examiners can make different decisions depending on whether it's morning or night, Monday or Friday.

They're examples of noise: variability in judgments that should be identical, Olivier Sibony, a professor and writer who specialises in the quality of strategic thinking told Kathryn Ryan.

“We sometimes think that it's fine to have variability. If we're talking about creative endeavours, If we're trying to generate ideas, we want divergence, we want people to disagree.

“But when we are looking at medical diagnosis, or when we're looking at hiring decisions, or when we're looking at professors grading essays, we assume that there is a truth and any variability from that presumed truth can only be error,” Sibony says.

One study gave 200 judges the same case and asked for their judgements on it, he says. The average sentence of all the judges were seven years.

“Now, if you picked two judges randomly in that sample of 200 judges, the average difference between them ,or the median difference between them, would be about four years. Which basically means that the moment you're assigned to judge A or judge B, your sentence has already been set to five years or to nine years, when the just sentence would be seven years.

“That's a very large difference and bear in mind, this is the median difference, which means that in half of the cases, for half of the pairs of judges, the difference would be even larger than that.”

Even larger divergences were found among insurance assessors he says.

“When we looked at an insurance company where underwriters have to set a price for an insurance have to set the premium that you will have to pay the average difference between two underwriters looking at the same case and the same exact request for a proposal was 55 percent.”

More troubling, he says, these differences can be seen in medical diagnosis - even with radiology.

“If you show the same X ray image to two doctors, theoretically, they should see the same thing. If one doctor says this is benign, and the other says this is malignant, you've got a serious problem.

“And that is, in fact, quite often the case. In fact, something even more striking happens when you show the same X ray or the same CAT scan image to the same doctor, a few weeks or a few months later, and of course, they don't remember the particular image that they're looking at, because they've been looking at a lot of images. In the meantime, these sometimes disagree with themselves, they sometimes change their judgement from one time to the next.”

When they looked at the sources of this variability, it turned out it could be anything from the day of the week, time of day or fatigue.

“One striking example is when you look at American doctors, they prescribe a lot more opioids in the afternoon than they do in the morning.

Presumably, because in the afternoon they are tired, they're running late and it's a bit more difficult to resist a patient who is asking for opioids, or maybe it's a bit more difficult to explain to someone who is in pain that something less strong, might in fact be a good choice.”

Conversely, he says, screening is more likely to be prescribed in the morning.

“Because again, conversely, you can imagine that it takes time to explain why this is important. If it's not urgent, you can always think you'll do it the next time.

“So, when you're running late, you do it less when you still have time in the morning, you do it more.”

The way to mitigate this problem is break the task down into smaller assignments, he says.

“The basic piece of advice we can give to anyone making a complicated judgment is try to break it down into components, try to structure it, and to be as fact-based in making each of the component judgments.

“And be as independent in each of those judgments from the other judgments as you can.”

The Apgar index for assessing new-born babies uses just such an approach, he says.

“When you're looking at the baby and you're measuring the baby's pulse, you're not looking at the baby's colour. And so if you think the colour is fine, you then measure the pulse and you say, oh, actually we have a problem with the pulse, we're not going to be influenced by the colour when you're looking at the pulse.”

This step-by-step technique not only reduces conformation bias but also the halo effect, he says.

“When we have a positive impression of someone, we will look at all the skills or the competencies of that person in the halo of the first good impression, which is why if you are doing hiring interviews, and you you're in the first five minutes of the interview, you form a good impression of the candidate, you will tend to say that that candidate checks all the boxes that you had on the form you're looking at.

“And if you had a bad impression, at the beginning, you will tend to say, no this candidate is not very good on technical skills and so on when in fact, those judgments should be independent of each other.”

This kind of halo effect and noise is very present in the ubiquitous performance review, Siborny says.

“When we think of all the time and all the effort that organisations dedicate to performance reviews, it's quite concerning that so much of the outcome of this performance reviews really has nothing to do with the performance of the person being evaluated, about three quarters of that performance review variance is noise.”

Olivier Sibony is the co-author of Noise, along with Daniel Kahneman and Cass R. Sunstein.