This post was co-authored by the Forecasting Research Institute and Avital Morris. Thanks to Josh Rosenberg for managing this work, Kayla Gamin and Bridget Williams for fact-checking and copy-editing, and the whole FRI XPT team for all their work on this project.
Summary
- We think the XPT has provided some useful insights into the way that sophisticated specialists and generalists conduct debates about the greatest risks facing humanity and how they translate their beliefs about those risks into quantitative forecasts. But we think there are many questions left unanswered, which we want to explore in future research.
- We currently have five topics that we’re interested in exploring in future research, including the next iteration of the XPT:
- Explore the value of early warning indicators from XPT-1
- Produce better early warning indicators
- Develop better methods of facilitating productive adversarial collaborations between disagreeing schools of thought
- Identify and validate better methods of eliciting low-probability forecasts
- Make these forecasts more relevant to policymakers
- We’d love to hear other ideas for questions to explore and any other feedback.
- We’re also currently hiring!
- You can provide feedback or apply to work with us here.
Our thoughts on next steps for XPT
As we at FRI look back on this first version of the XPT[1] (let’s call it XPT-1), we are thinking ahead to future research, including an updated version of the XPT. We see the XPT as an iterative process and we have learned a lot from this first round. Some of these lessons are practical, down-to-earth insights into how to recruit and retain the talent of busy professionals in a demanding multi-month marathon. Other lessons are more substantive. XPT-1 yielded first-of-a-kind, in-depth assessments of how sophisticated specialists and generalists conduct debates about the greatest risks facing humanity and how they translate their beliefs about those risks into quantitative forecasts.
We are also acutely aware that XPT-1 leaves an array of key questions unanswered and we plan to use future iterations of the XPT to produce answers. In our upcoming XPT research, we have five topics we want to study in more depth:
- Explore the value of early warning indicators from XPT-1
- Produce better early warning indicators
- Develop better methods of facilitating productive adversarial collaborations between disagreeing schools of thought
- Identify and validate better methods of eliciting low-probability forecasts
- Make these forecasts more relevant to policymakers
Explore the value of early warning indicators from XPT-1
In XPT-1, we asked numerous questions about how the risk landscape would evolve in the short term, including several “early warning” AI-risk indicators such as: “How much will be spent on compute in the largest AI experiment by 2024?” The XPT also made a major effort to identify crux questions that, once forecasters know the answer in 2024, will tell them more about longer-run futures in 2030 or even 2100.
In our future work, we want to ask participants to update their longer-run forecasts based on the outcome of the short-run questions. For example, if spending on AI R&D turns out to be lower or higher by 2024 than a forecaster predicted it would be, will they choose to update their longer-run predictions of AI-related risk?
When we evaluate forecasters’ accuracy on short-run questions in 2024, we expect the XPT to also provide new evidence on the relationship between reasoning quality and forecasting accuracy. We captured more than five million words of reasoning, deliberation, and synthesis across the XPT and can analyze that data to determine which types of arguments are related to forecast accuracy across all the areas the XPT covered. For example, we’ll be able to compare the reasoning “quality” of optimists and pessimists: do people who expect things to go well or poorly tend to give more convincing arguments for their beliefs? This type of analysis could be particularly valuable since it can provide insights beyond forecasting. By studying the properties of arguments associated with forecasting accuracy, we expect to make novel contributions that can be applied to debate more generally.
Produce better early warning indicators
If we want to put as much emphasis on the quality of questions as on the accuracy of forecasts, we need to incentivize high-quality questions as rigorously as we do forecasting accuracy.[2] For many of the short-run questions in the XPT, experts and superforecasters agreed, while disagreeing strongly about the existential risks that those questions were supposed to predict. That tells us that at least some of the questions we asked did not succeed at finding short-run predictors of existential risks: if people who are very concerned and not concerned about an existential risk by 2100 expect the same results in 2030, then finding out what happens in 2030 will not give us additional information about who is right about 2100.
We have already begun to develop formal metrics for what constitutes a useful forecast question and we plan to build a database of strong candidate questions. For our future projects, we hope to create a longitudinal panel of forecasters and question-generators (often but not always the same people) that works together over the coming years to shed more light on the feasibility of early warning indicators for various topics.
Develop better methods of facilitating productive adversarial collaborations between disagreeing schools of thought
Before XPT-1, we thought that deliberation and argument would lead forecasters to update on one another’s beliefs and arguments and eventually converge on similar probabilities. But on the questions where participants originally disagreed most, such as the likelihood of human extinction due to AI by 2100, this did not occur: participants disagreed nearly as much at the end of the tournament as they had at the beginning.
We want to understand why that happened and under what circumstances discussion and debate lead to consensus rather than stalemate. To ensure that XPT forecasters’ failure to converge was not due to fluky method-specific factors, we are developing more tightly choreographed forms of “adversarial collaborations” that have two novel properties. First, each side must demonstrate it fully grasps each major argument of the other side before offering rebuttals, using, for example, ideological Turing tests. The other side then must demonstrate it fully grasps the rebuttals before responding. Second, adversarial collaborators will agree ex-ante on which shorter-range cruxes would move their judgments of existential risk probability once objectively resolved, and participants will focus on generating those questions.
In the past few months, we have completed some work along those lines and are looking forward to building on it in the future.
Identify and validate better methods of eliciting low-probability forecasts
Researchers have long known about the instability of tiny probability estimates.[3] In most situations, people underweight or even completely ignore the possibility of a very unlikely event, treating small probabilities as if they were 0%. On the other hand, when asked to question and think about those possibilities, people over-weight them, treating a very small probability as much bigger than it actually is. This “Heisenberg effect” of forecasting makes studying very improbable events even harder.
In addition, not knowing how to tell apart very small probabilities would make forecasting much less useful in many practical applications. Some advocates of the Precautionary Principle[4] have argued that if forecasters cannot reliably reason about order of magnitude differences between very small probabilities, we should adopt an extremely risk-averse threshold for any technology: if we are not sure whether forecasters can tell 0.001% apart from 0.000001% (a magnitude difference of 1,000x), then we should treat a 0.000001% forecast of a catastrophic risk as if it were 0.001% and be much more cautious about potential dangers.
We are excited to treat forecasters’ ability to compare very small probabilities as an empirical question. To resolve it, we will assess the skill of well-incentivized top forecasters at: (a) making reliable (non-contradictory) judgments of micro-probability events; (b) making accurate judgments in simulated worlds that permit ground truth determinations. We are also now testing new methods of eliciting micro-probabilities. For example, we could try a method often used in psychophysical scaling, that asks people for comparative judgments, by giving participants anchor or comparison values (e.g., the risk of getting struck by lightning) and asking them whether their risk estimate is higher or lower.
Make these forecasts more relevant to policymakers
Participants in XPT-1 addressed many topics that have huge potential policy implications. However, it is not always obvious how policymakers could or should incorporate these forecasts into their work.
The most direct solution for ensuring the policy relevance of high-stakes forecasts is to shift the focus of XPT-style elicitation from event-focused forecasting to policy-conditional forecasting. Instead of just asking “How likely is Y?” we can ask “How likely is Y if society goes down this or that policy path?” We can then input those probabilities into a preferred cost-benefit framework. We also recommend experimenting with a new format, Risk Mitigation tournaments,[5] designed to accelerate convergence on good policy options using intersubjective incentives: asking each of two teams of forecasters with strong track records to do their best at predicting the rank order policy preferences of the other team.
More ideas from you?
As we get ready for XPT-2, we want to hear feedback and suggestions. Which of these ideas for future research sound interesting? What problems do you see? What questions are we missing, and what methods could we be using to answer them? Please leave your suggestions in the comments.
We are eager to make progress toward answering our biggest questions about how to predict future events and how to assess good forecasts so that policymakers can use them to make better decisions.
If you’d like to participate in our future studies or apply to work with us on future projects, please register your interest or apply here.
- ^
FRI’s Existential Risk Persuasion Tournament, held from June through October 2022, asked 89 superforecasters and 80 experts to develop forecasts on questions related to existential and catastrophic risk, as well as other questions on humanity’s future. Further detail on the tournament and the overall results are available in this report.
- ^
This is the question equivalent of proper scoring rules for forecasts.
- ^
Daniel Kahneman and Amos Tversky, “Prospect Theory: An Analysis of Decision under Risk,” Econometrica, Vol. 47, No. 2: 263-292, (March 1979), https://doi.org/10.2307/1914185.
- ^
H. Orri Stefánsson, "On the limits of the precautionary principle." Risk Analysis 39, no. 6 (2019): 1204-1222, https://doi.org/10.1111/risa.13265/.
- ^
Ezra Karger, Pavel D. Atanasov, and Philip Tetlock, “Improving Judgments of Existential Risk: Better Forecasts, Questions, Explanations, Policies,” SSRN Working Paper (2022). https://doi.org/10.2139/ssrn.4001628.
I think this is important work, so I’m glad to hear that it’s a priority.
It’s a two-pronged approach, right? Measuring how reliable forecasters are at working with small probabilities, and using better elicitation measures to reduce the size of any error effect
I suspect that when measuring how reliable forecasters are at working with small probabilities you’ll find a broad range of reliability. It would be interesting to see how the XPT forecasts change if you exclude those with poor small-probability understanding, or if you weight each response according to the forecaster’s aptitude.
Using comparative judgements seems like a good avenue for exploration. Have you thought about any of the following?
One more thing:
In theory, yes, but I think people are generally much more likely to say 0.001% when their “true” probability is 0.000001% than vice versa - maybe because we very rarely think about events of the order of 0.000001%, so 0.001% seems to cover the most unlikely events.
You might counter that we just need a small proportion of respondents to say 0.00001% when their “true” probability is 0.001% to risk undervaluing important risks. But not if we are using medians, as the XPT does.
I could be wrong on the above, but my take is that understanding the likely direction of errors in the “0.001% vs 0.000001%” scenarios maybe ought to be a priority.
Thanks for sharing!
I think it would be nice to have experts with a more diverse background. From the report:
Hello,
In the XPT, you ask about the probability of catastrophes where the fraction of the initial population which survives is 90 % (= 1 - 0.10) and 6*10^-7 (= 5*10^3/(8*10^9)). I think it would be good if you asked about intermediate fractions (e.g. 10 %, 1 %, ..., and 10^-7). I guess many forecasters are implicitly estimating their probabilities of extinction from population losses of 99 % to 99.99 %, whereas reaching a population of 5 k (as in your questions about extinction risk) would require a population loss of 99.99994 % (= 1 - 5*10^3/(8*10^9)), which I guess is much less likely.
In addition, you could decompose the questions about extinction risk into multiple ones, and then have forecasters working on these. I guess extinction risk will be much lower if estimated this way.