Shallow Review of Consistency in Statement Evaluation

by Elizabeth9 min read9th Sep 20196 comments


Forecasting & PredictionGroup Rationality


Most existing forecasting or evaluation platform questions are for very clearly verifiable questions:

  • "Who will win the next election?"
  • "How many cars will Tesla sell in 2030?"
  • “How many jelly beans are in this jar?”

But many of the questions we care about are do not look like this. They might…

  • Be severely underspecified, e.g. “How much should we charge this customer for this vague feature request?”
  • Involve value judgements, e.g. “What is the optimum prison sentence for this convict?”, “How much does this plaintiff deserve for pain and suffering?”
  • Not have a clear stopping point, e.g. "What is the relative effectiveness of AI safety research vs. bio risk research?"
  • Require multiple steps instead of a yes/no or numerical answer, e.g. “What treatment is appropriate for this patient with precancerous cells?”
  • Not have good referrents, e.g. “What is the market size for this completely new tech?”

An entity who could answer these questions well would be a very valuable asset. But what does well even mean here? We want people to be accurate, of course, but in many cases we also need their predictions/evaluations to be consistent to be actionable. This is especially true when fairness norms are in play, such as pricing[1] and prison sentencing.

There is a lot of research showing that people make inconsistent evaluations (with each other and themselves across time) across a wide variety of fields, even those that more closely resemble the “easy” questions above (valuing stocks, appraising real estate, sentencing criminals, evaluating job performance, auditing financial statements)[2]. It is even more difficult to consistently evaluate or predict novel questions or low-frequency events, like “Will India use a nuclear weapon on Pakistan by 1/1/20” or “How much counterfactual value has this organization created?”.

This paper is a shallow review of the literature around how to get entities to make consistent judgements. I want to note up front that a major limitation of this write-up and of shallow reviews in general is that I mostly relied on authors’ descriptions of their work and conclusions, rather than verifying their experimental design and conclusions for myself, or looking up others’ opinions of papers. As such, this post should be taken as a description of the state of the literature, not the state of the world.

Speaking of underspecified questions, “how to get consistent answers to complicated questions?” sure is one. I started this research project with a vague sense of an area from Ozzie Gooen; as I iterated, we came up with more specific questions. The following is a list of questions or hooks that came up as we discussed the research:

  1. Overall, what literature is available to answer the question “how to get people to answer messy questions consistently?”
  2. What are the costs of consistency?
  3. How often are evaluations / sentences simply misunderstood by people? Is there a decent science behind understanding and expecting levels of misunderstandings in different cases?
  4. How often are the evaluations doing the wrong things? What are the failure modes? For instance, one failure mode is that they have misjudged the value of some intermediate variables. Maybe that’s all there is?
  5. In what domains are subjective measures likely to be informative, especially about things other than subjective states? (For instance, the subjective measure of “I think this work was done at a 8/10, is very different than, “I’m feeling an 8/10 now”, in that both of them require an intuitive judgement, but in one case the intuitive judgement **is** the measure.
  6. What are the main problems that come up for nonprofit evaluations? Have they found any methods that would be useful to us?
  7. How difficult is/can it be to come up with these composite indexes/linear models? What should we know when attempting them?
  8. Can we have any clever models where evaluators are really just predicting what other evaluators would say?
  9. What are good areas for follow-ups?

Some of these questions were answered in more detail than others, some were not answerable at all in the time available. Here is what I found.

Methods to Improve Consistency in Evaluations

  • Hold Keynesian beauty contests, in which the goal is to guess what other people will guess, not what you think is true.
    • A single study suggested this improves recall and precision.
    • “What do other people think?” is also a well known trick for getting people to be honest about opinions over which they expect to receive censure.
  • Use groups instead of individuals (Zhitomirsky-Geffet, Bar-Ilan, and Mark Levene)
    • Configuring groups such that each group has the same variety of expertise allows you to use some non-common knowledge in your estimates (personal guess).
    • For procedures with many iterations (e.g., image labeling), combine multiple predictors with a mathematical model that incorporates varying skill, expertise, and task difficulty level (Welinder et al, Bachrach et al)
  • Remove extraneous information. Individuals’ estimates are widely affected by extraneous information even when they themselves view it as extraneous (Grimstad and Jørgensen, Stewart). In the real world this may be a lengthy process of determining what information is extraneous.
  • Force participants to write up models of their thinking (using variables for unknowns), and then evaluate the variables separately (Kahneman, Lovallo, and Sibony).
    • Kahneman suggests 5-6 variables, and absolutely no more than 8 (Knowledge@Wharton).
    • To preserve independence, have individuals write up their models before sharing with the group and coming to consensus.
    • See “Creating Composite Models” below.
  • Let participants know you’ll be asking about their reasoning afterwords (Kahneman, Lovallo, and Sibony).
  • Create reference guides that forecasters can refer to while making an estimate (e.g. “this is what Level 4 teaching looks like, this is what Level 5 teaching looks like). Better, after they’ve made their estimate show them the nearest reference and ask how they compare (Penny, Johnson, and Gordon).
    • In the case of novel questions, I speculate that it would be useful to make an imaginary reference chart (“this is what a country that’s 20% likely to launch a nuclear missile in the next year would look like…”) .
  • Some evaluations can be broken down into sub-evaluations, in which people tend to agree on the first step but disagree on the second. E.g., they’ll agree on the ordering of the severity of personal injury cases, but translate the severity into wildly different dollar amounts (Sunstein, Kahneman, and Schkade). Or doctors will agree on the severity of a case but not the patient’s future outcome (Dwyer et al).
  • Training and retraining. With e.g. educational assessment, this means giving people reference evaluations and then practicing on a second set of evaluations until they get the right result (Wikipedia, Polin et al). Even after this was done, evaluators benefited from periodic retraining (Polin et al).

Creating Composite Models

One idea that came up repeatedly in business literature was forcing predictors to build (potentially very crude) mathematical models.

Kahneman recommends the following procedure, which he calls creating a “reasoned rule” (summary from Jason Collins):

  1. Select six to eight variables that are distinct and obviously related to the predicted outcome. Assets and revenues (weighted positively) and liabilities (weighted negatively) would surely be included, along with a few other features of loan applications.
  2. Take the data from your set of cases (all the loan applications from the past year) and compute the mean and standard deviation of each variable in that set.
  3. For every case in the set, compute a “standard score” for each variable: the difference between the value in the case and the mean of the whole set, divided by the standard deviation. With standard scores, all variables are expressed on the same scale and can be compared and averaged.
  4. Compute a “summary score” for each case―the average of its variables’ standard scores. This is the output of the reasoned rule. The same formula will be used for new cases, using the mean and standard deviation of the original set and updating periodically.
  5. Order the cases in the set from high to low summary scores, and determine the appropriate actions for different ranges of scores. With loan applications, for instance, the actions might be “the top 10% of applicants will receive a discount” and “the bottom 30% will be turned down.”

Richard H. Moss recommends a similar procedure in his paper on estimating climate change:

  1. For each of the major findings you expect to be developed in your chapter, identify the most important factors and uncertainties that are likely to affect the conclusions. Also specify which important factors/variables are being treated exogenously or fixed, as it will almost always be the case that some important components will be treated in this way when addressing the complex phenomena examined in the TAR.
  2. Document ranges and distributions in the literature, including sources of information on the key causes of uncertainty. Note that it is important to consider the types of evidence available to support a finding (e.g., distinguish findings that are well established through observations and tested theory from those that are not so established).
  3. Given the nature of the uncertainties and state of science, make an initial determination of the appropriate level of precision—is the state of science such that only qualitative estimates are possible, or is quantification possible, and if so, to how many significant digits? As the assessment proceeds, recalibrate level of precision in response to your assessment of new information.
  4. Quantitatively or qualitatively characterize the distribution of values that a parameter, variable, or outcome may take. First identify the end points of the range that the writing team establishes, and/or any high consequence, low probability outcomes or “outliers.” Particular care needs to be taken to specify what portion of the range is included in the estimate (e.g., this is a 90% confidence interval) and what the range is based on. Then provide an assessment of the general shape (e.g., uniform, bell, bimodal, skewed, symmetric) of the distribution. Finally, provide your assessment of the central tendency of the distribution (if appropriate).
  5. Using the terms described below, rate and describe the state of scientific information on which the conclusions and/or estimates (i.e. from step 4) are based.
  6. Prepare a “traceable account” of how the estimates were constructed that describes the writing team’s reasons for adopting a particular probability distribution, including important lines of evidence used, standards of evidence applied, approaches to combining/reconciling multiple lines of evidence, explicit explanations of methods for aggregation, and critical uncertainties.
  7. OPTIONAL: Use formal probabilistic frameworks for assessing expert judgment (i.e. decision analytic techniques), as appropriate for each writing team.

Costs of Consistency

It is trivial to get 100% consistency: just have everyone guess 0 every time. If you’re feeling fancy they could guess base rate. Obviously this would be pointless because you would learn nothing

If two individuals are to come up with the same answer to a problem, they can only use information both of them have. This should on average damage the accuracy of the work (if it doesn’t, you have more problems). This can be okay in certain circumstances; the penal system sometimes values predictability over getting exactly the right answer, customers get irate if quoted widely varying prices. But often it is not okay, and false precision is harmful.

Measures of Noise in Answers

There’s a robust field of inter-rater reliability statistics, of which The Handbook of Inter-Rater Reliability appears to be the best single source. Due to time constraints and the density of the subject I did not follow up on this further.

Measures of Ambiguity in Questions

I found no data on ambiguity in predictions or statement evaluation. The richest source of related data was on ambiguity in product requirement specifications. There are several systems for measuring ambiguity in natural language, the most prominent of which is LOLITA. Other systems include:

I found no data on the cost ambiguous requirements exact, or how much of this cost could be avoided with NLP systems. These systems had major types of ambiguity they could not detect and were not a substitute for human evaluation.

Subjective Judgements

I found very mixed results on whether subjective judgements could replace objective composite measurements, and no obvious trends in which areas were robust to subjective predictions: negative, positive, negative, positive, negative.

Papers tended to assume the objective measurements were more accurate, without considering how they could be tampered with. E.g.,in this study of the Denver police, crime rates were not found to be heavily correlated with resident satisfaction. The paper seemed to think this was a deficit in the residents’ understanding, as opposed to the police department interfering with crime statistics. So perhaps one area where subjective measurements are preferable is where nominally objective measurements are controlled by the institution being measured.

Limitations of This Paper and Future Work

  • Due to time constraints, I had to take papers’ word for their findings. I did not have time to look for replicability or statistical errors, and could only do quick checks of methodology. A future deep dive in any subject covered should include a more skeptical reading of my sources.
  • Most work done on inter-rater reliability is in fields like medicine, teacher evaluations, and image labeling. These involve estimating fairly known things with lots of reference instances. This is a fundamentally different kind of problem than predicting novel, low-probability events- among other differences, it’s harder to generate reference charts and training data.
  • There are many, many on inter-rater reliability in narrow fields. Sometimes they contain suggestions for mitigations; usually they do not. Additionally, an overwhelming majority of these studies are on cancer-diagnosis-type problems, not low-frequency-global-event-type problems. I read a few of these and got some value out of them (mostly mitigation techniques, such as asking why someone believed something), but hit diminishing returns after a few papers. A more thorough reading of the genre of “Humans are unreliable” would probably find more mitigations.
  • There are also many, many studies on using multiple human labelers to do image labeling or NLP tasks, often using mathematical models. I did not have time to dig into the actual models and took the papers’ word for their power. This paper on bootstrapping from 0 to known question answer, question difficulty, and IQ assessment of participants looks especially interesting.

Edit 9/16: This review paper, found by DanielFilan, looks even better.

  • A more thorough understanding of the statistics would be useful, perhaps starting with The Handbook of Inter-Rater Reliability or
  • How to get the best work out of groups working together? This is a social psychology research project in its own right.
  • There is a lot of information about how to make crowds more accurate, but not more consistent.
  • Investigate the bias-variance trade off more, especially for human decision making.
  • Books that would be relevant to the questions:
    • Protocol Analysis includes sections on coding verbal reports reliably.
    • Daniel Kahnemen is writing a tantalizingly relevant book (Noise) that will not be available for at least a year, possibly more.
    • Emerging Trends in the Development and Application of Composite Indicators
    • Superforecasting
    • The Power of Mathematical Thinking
    • The Power of Intuition (least sure about that one)
    • WISER: Getting Beyond Groupthink to Make Groups Smarter
    • Psychology of Intelligence Analysis

Edit 9/16: Raemon describes this as "Thinking Fast and Slow" for CIA agents.

    • Collective Wisdom: Principles and Mechanisms
    • Dialogue Mapping: Building Shared Understanding of Wicked Problem
    • How to Measure Anything
    • Uncertain Judgements: Eliciting Experts' Probabilities

Edit 9/16: on skimming, Ruby did not find anything specifically related to consistency.

    • Cambridge Handbook on Expertise

This report was funded by a forecasting infrastructure project managed by Ozzie Gooen, which is itself funded by a grant from the Effective Altruism Long Term Future Fund.

My raw notes are available here.

[1] While companies are typically trying to maximize profits, customers are often extremely sensitive to perceived injustices in pricing, and inconsistencies are perceived as injustices.

[2] List courtesy

9/16/2019: Made various updates based on other people's research, seen in the comments of this post, related questions, and privately shared write ups. Thanks to everyone for coming out.