Authors of linked report: Josh Rosenberg, Ezra Karger, Avital Morris, Molly Hickman, Rose Hadshar, Zachary Jacobs, Philip Tetlock
Today, the Forecasting Research Institute (FRI) released “Roots of Disagreement on AI Risk: Exploring the Potential and Pitfalls of Adversarial Collaboration,” which discusses the results of an adversarial collaboration focused on forecasting risks from AI.
In this post, we provide a brief overview of the methods, findings, and directions for further research. For much more analysis and discussion, see the full report: https://forecastingresearch.org/s/AIcollaboration.pdf
Abstract
We brought together generalist forecasters and domain experts (n=22) who disagreed about the risk AI poses to humanity in the next century. The “concerned” participants (all of whom were domain experts) predicted a 20% chance of an AI-caused existential catastrophe by 2100, while the “skeptical” group (mainly “superforecasters”) predicted a 0.12% chance. Participants worked together to find the strongest near-term cruxes: forecasting questions resolving by 2030 that would lead to the largest change in their beliefs (in expectation) about the risk of existential catastrophe by 2100. Neither the concerned nor the skeptics substantially updated toward the other’s views during our study, though one of the top short-term cruxes we identified is expected to close the gap in beliefs about AI existential catastrophe by about 5%: approximately 1 percentage point out of the roughly 20 percentage point gap in existential catastrophe forecasts. We find greater agreement about a broader set of risks from AI over the next thousand years: the two groups gave median forecasts of 30% (skeptics) and 40% (concerned) that AI will have severe negative effects on humanity by causing major declines in population, very low self-reported well-being, or extinction.
Extended Executive Summary
In July 2023, we released our Existential Risk Persuasion Tournament (XPT) report, which identified large disagreements between domain experts and generalist forecasters about key risks to humanity (Karger et al. 2023). This new project—a structured adversarial collaboration run in April and May 2023—is a follow-up to the XPT focused on better understanding the drivers of disagreement about AI risk.
Methods
We recruited participants to join “AI skeptic” (n=11) and “AI concerned” (n=11) groups that disagree strongly about the probability that AI will cause an existential catastrophe by 2100. The skeptic group included nine superforecasters and two domain experts. The concerned group consisted of domain experts referred to us by staff members at Open Philanthropy (the funder of this project) and the broader Effective Altruism community.
Participants spent 8 weeks (skeptic median: 80 hours of work on the project; concerned median: 31 hours) reading background materials, developing forecasts, and engaging in online discussion and video calls. We asked participants to work toward a better understanding of their sources of agreement and disagreement, and to propose and investigate “cruxes”: short-term indicators, usually resolving by 2030, that would cause the largest updates in expectation to each group’s view on the probability of existential catastrophe due to AI by 2100.
Results: What drives (and doesn’t drive) disagreement over AI risk
At the beginning of the project, the median “skeptic” forecasted a 0.10% chance of existential catastrophe due to AI by 2100, and the median “concerned” participant forecasted a 25% chance. By the end, these numbers were 0.12% and 20% respectively, though many participants did not attribute their updates to arguments made during the project.
We organize our findings as responses to four hypotheses about what drives disagreement:
Hypothesis #1 - Disagreements about AI risk persist due to lack of engagement among participants, low quality of participants, or because the skeptic and concerned groups did not understand each others' arguments
We found moderate evidence against these possibilities. Participants engaged for 25-100 hours each (skeptic median: 80 hours; concerned median: 31 hours), this project included a selective group of superforecasters and domain experts, and the groups were able to summarize each others' arguments well during the project and in follow-up surveys. (More)
Hypothesis #2 - Disagreements about AI risk are explained by different short-term expectations (e.g. about AI capabilities, AI policy, or other factors that could be observed by 2030)
Most of the disagreement about AI risk by 2100 is not explained by indicators resolving by 2030 that we examined in this project. According to our metrics of crux quality, one of the top cruxes we identified is expected to close the gap in beliefs about AI existential catastrophe by about 5% (approximately 1.2 percentage points out of the 22.7 percentage point gap in forecasts for the median pair) when it resolves in 2030. For at least half of participants in each group, there was a question that was at least 5-10% as informative as being told by an oracle whether AI in fact caused an existential catastrophe or not. It is difficult to contextualize the size of these effects because this is the first project applying question metrics to AI forecasting questions that we are aware of.
However, near-term cruxes shed light on what the groups believe, where they disagree, and why:
- Evaluations of dangerous AI capabilities are relevant to both groups. One of the strongest cruxes that will resolve by 2030 is about whether METR (formerly known as ARC Evals) (a) or a similar group will find that AI has developed dangerous capabilities such as autonomously replicating and avoiding shutdown. This crux illustrates a theme in the disagreement: the skeptic group typically did not find theoretical arguments for AI risk persuasive but would update their views based on real-world demonstrations of dangerous AI capabilities that verify existing theoretical arguments. If this question resolves negatively then the concerned group would be less worried, because it would mean that we have had years of progress from today’s models without this plausible set of dangerous capabilities becoming apparent. (More)
- Generally, the questions that would be most informative to each of the two groups are fairly distinct. The concerned group’s highest-ranked cruxes tended to relate to AI alignment and alignment research. The skeptic group’s highest-ranked cruxes tended to relate to the development of lethal technologies and demonstrations of harmful AI power-seeking behavior. This suggests that many of the two groups’ biggest sources of uncertainty are different, and in many cases further investigation of one group’s uncertainties would not persuade the other. (More)
- Commonly-discussed topics – such as near-term economic effects of AI and progress in many AI capabilities – did not seem like strong cruxes. (More)
Hypothesis #3 - Disagreements about AI risk are explained by different long-term expectations
We found substantial evidence that disagreements about AI risk decreased between the groups when considering longer time horizons (the next thousand years) and a broader swath of severe negative outcomes from AI beyond extinction or civilizational collapse, such as large decreases in human well-being or total population.
Some of the key drivers of disagreement about AI risk are that the groups have different expectations about: (1) how long it will take until AIs have capabilities far beyond those of humans in all relevant domains; (2) how common it will be for AI systems to develop goals that might lead to human extinction; (3) whether killing all living humans would remain difficult for an advanced AI; and (4) how adequately they expect society to respond to dangers from advanced AI.
Supportive evidence for these claims includes:
- Both groups strongly expected that powerful AI (defined as “AI that exceeds the cognitive performance of humans in >95% of economically relevant domains”) would be developed by 2100 (skeptic median: 90%; concerned median: 88%). Though, some skeptics argue that (i) strong physical capabilities (in addition to cognitive ones) would be important for causing severe negative effects in the world, and (ii) even if AI can do most cognitive tasks, there will likely be a “long tail” of tasks that require humans.
- The two groups also put similar total probabilities on at least one of a cluster of bad outcomes from AI happening over the next 1000 years (median 40% and 30% for concerned and skeptic groups respectively). But they distribute their probabilities differently over time: the concerned group concentrates their probability mass before 2100, and the skeptics spread their probability mass more evenly over the next 1,000 years.
- We asked participants when AI will displace humans as the primary force that determines what happens in the future. The concerned group’s median date is 2045 and the skeptic group’s median date is 2450—405 years later.
Overall, many skeptics regarded their forecasts on AI existential risk as worryingly high, although low in absolute terms relative to the concerned group.
Despite their large disagreements about AI outcomes over the long term, many participants in each group expressed a sense of humility about long-term forecasting and emphasized that they are not claiming to have confident predictions of distant events.
Hypothesis #4 - These groups have fundamental worldview disagreements that go beyond the discussion about AI
Disagreements about AI risk in this project often connected to more fundamental worldview differences between the groups. For example, the skeptics were somewhat anchored on the assumption that the world usually changes slowly, making the rapid extinction of humanity unlikely. The concerned group worked from a different starting point: namely, that the arrival of a higher-intelligence species, such as humans, has often led to the extinction of lower-intelligence species, such as large mammals on most continents. In this view, humanity’s prospects are grim as soon as AI is much more capable than we are. The concerned group also was more willing to place weight on theoretical arguments with multiple steps of logic, while the skeptics tended to doubt the usefulness of such arguments for forecasting the future.
Results: Forecasting methodology
This project establishes stronger metrics than have existed previously for evaluating the quality of AI forecasting questions. And we view this project as an ongoing one. So, we invite readers to try to generate cruxes that outperform the top cruxes from our project thus far—an exercise that underscores the value of establishing comparative benchmarks for new forecasting questions. See the “Value of Information” (VOI) and “Value of Discrimination” (VOD) calculators (a) to inform intuitions about how these question metrics work. And please reach out to the authors with suggestions for high-quality cruxes.
Broader scientific implications
This project has implications for how much we can expect rational debate to shift people’s views on AI risk. Thoughtful groups of people engaged each other for a long time but converged very little. This raises questions about the belief formation process and how much is driven by explicit rational arguments vs. difficult-to-articulate worldviews vs. other, potentially non-epistemic factors (see research literature on motivated cognition, such as Gilovich et al. 2002; Kunda, 1990; Mercier and Sperber, 2011).
One notable finding is that a highly informative crux for both groups was whether their peers would update on AI risk over time. This highlights how social and epistemic groups can be important predictors of beliefs about AI risk.
Directions for further research
We see many other projects that could extend the research begun here to improve dialogue about AI risk and inform policy responses to AI.
Examples of remaining questions and future research projects include:
- Are there high-value 2030 cruxes that others can identify?
- We were hoping to identify cruxes that would, in expectation, lead to a greater reduction in disagreement than the ones we ultimately discovered. We are interested to see whether readers of this report can propose higher value cruxes.
- If people disagree a lot, it is likely that no single question would significantly reduce their disagreement in expectation. If such a question existed, they would already disagree less. However, there might still be better crux questions than the ones we have identified so far.
- What explains the gap in skeptics’ timelines between “powerful AI” and AI that replaces humanity as the driving force of the future? In other words, what are the skeptics’ views on timelines until superintelligent AI (suitably defined)? A preliminary answer is here, but more research is needed.
- To what extent are different “stories” of how AI development goes well or poorly important within each group?
- The skeptic and concerned groups are not monoliths – within each group, people disagree about what the most likely AI dangers are, in addition to how likely those dangers are to happen.
- Future work could try to find these schools of thought and see how their stories do or do not affect their forecasts.
- Would future adversarial collaborations be more successful if they focused on a smaller number of participants who work particularly well together and provided them with teams of researchers and other aids to support them?
- Would future adversarial collaborations be more successful if participants invested more time in an ongoing way, did additional background research, and spent time with each other in person, among other ways of increasing the intensity of engagement?
- How can we better understand what social and personality factors may be driving views on AI risk?
- Some evidence from this project suggests that there may be personality differences between skeptics and concerned participants. In particular, skeptics tended to spend more time on each question, were more likely to complete tasks by requested deadlines, and were highly communicative by email, suggesting they may be more conscientious. Some early reviewers of this report have hypothesized that the concerned group may be higher on openness to experience. We would be interested in studying the influence of conscientiousness, openness, or other personality traits on forecasting preferences and accuracy.
- We are also interested in investigating whether the differences between the skeptics and concerned group regarding how much weight to place on theoretical arguments with multiple steps of logic would persist in other debates, and whether it is related to professional training, personality traits, or any other factors, as well as whether there is any correlation between trust in theoretical arguments and forecasting accuracy.
- How could we have asked about the correlations between various potential crux questions? Presumably these events are not independent: a world where METR finds evidence of power-seeking traits is more likely to be one where AI can independently write and deploy AI. But we do not know how correlated each question is, so we do not know how people would update in 2030 based on different possible conjunctions.
- How typical or unusual is the AI risk debate? If we did a similar project with a different topic about which people have similarly large disagreements, would we see similar results?
- How much would improved questions or definitions change our results? In particular:
- As better benchmarks for AI progress are developed, forecasts on when AIs will achieve those benchmarks may be better cruxes than those in this project.
- Our definition of “AI takeover” may not match people’s intuitions about what AI futures are good or bad, and improving our operationalization may make forecasts on that question more useful.
- What other metrics might be useful for understanding how each group will update if the other group is right about how likely different cruxes are to resolve positively?
- For example, we are exploring “counterpart credences” that would look at how much the concerned group will update in expectation if the skeptics are right about how likely a crux is, and vice versa.
- Relatedly, it might be useful to look for additional “red and green flags,” or events that would be large updates to one side if they happened, even if they are very unlikely to happen.
- This project shares some goals and methods with FRI’s AI Conditional Trees (a) project (report forthcoming), which works on using forecasts from AI experts to build a tree of conditional probabilities that is maximally informative about AI risk. Future work will bring each of these projects to bear on the other as we continue to find new ways to understand conditional forecasting and the AI risk debate.
In 2030, most of the questions we asked will resolve, and at that point, we will know much more about which side’s short-run forecasts were accurate. This may provide early clues into whether one group's methods and inclinations makes them more accurate at AI forecasting over a several year period. The question of how much we should update on AI risk by 2100 based on those results remains open. If the skeptics or the concerned group turn out to be mostly right about what 2030’s AI will be like, should we then trust their risk assessment for 2100 as well, and if so, how much?
We are also eager to see how readers of this report respond. We welcome suggestions for better cruxes, discussion about which parts of the report were more or less valuable, and suggestions for future research.
For the full report, see https://forecastingresearch.org/s/AIcollaboration.pdf
'The concerned group also was more willing to place weight on theoretical arguments with multiple steps of logic, while the skeptics tended to doubt the usefulness of such arguments for forecasting the future.'
Seems to like it's wrong to say that this is a general "difference in worldview", until we know whether "the concerned group" (i.e. the people who think X-risk from AI is high) think this is the right approach to all/most/many questions, or just apply it to AI X-risk in particular. If the latter, there's a risk it's just special pleading for an idea they are attached to, whereas if the former is true, they might (or might not) be wrong, but it's not necessarily bias.
The first bullet point of the concerned group summarizing their own position was "non-extinction requires many things to go right, some of which seem unlikely".
This point was notably absent from the sceptics summary of the concerned position.
Both sceptics and concerned agreed that a different important point on the concerned side was that it's harder to use base rates for unprecedented events with unclear reference classes.
I think these both provide a much better characterisation of the difference than the quote you're responding to.
"The concerned group also was more willing to place weight on theoretical arguments with multiple steps of logic, while the skeptics tended to doubt the usefulness of such arguments for forecasting the future."
Assuming the "concerned group" are likely to be more EA aligned (uncertain about this), I'm surprised they place more weight on multi-stage theory than the forecasters. I'm aware its hard to use evidence for a problem as novel as AI progression, but it makes sense to me to try and I'm happy the forecasters did.
Here's a hypothesis:
The base case / historical precedent for existential AI risk is:
- AGI has never been developed
- ASI has never been developed
- Existentially deadly technology has never been developed (I don't count nuclear war or engineered pandemics, as they'll likely leave survivors)
- Highly deadly technology (>1M deaths) has never been cheap and easily copied
- We've never had supply chains so fully automated end-to-end that they could become self-sufficient with enough intelligence
- We've never had technology so networked that it could all be taken over by a strong enough hacker
Therefore, if you're in the skeptic camp, you don't have to make as much of an argument about specific scenarios where many things happen. You can just wave your arms and say it's never happened before because it's really hard and rare, as supported by the historical record.
In contrast, if you're in the concerned camp, you're making more of a positive claim about an imminent departure from historical precedent, so the burden of proof is on you. You have to present some compelling model or principles for explaining why the future is going to be different from the past.
Therefore, I think the concerned camp relying on theoretical arguments with multiple steps of logic might be a structural side effect of them having to argue against the historical precedent, rather than any innate preference for that type of argument.
I think that is probably the explanation yes. But I don't think it gets rid of the problem for the concerned camp that usually, long complex arguments about how the future will go are wrong. This is not a sporting contest, where the concerned camp are doing well if they take a position that's harder to argue for and make a good go of it. It's closer to the mark to say that if you want to track truth you should (usually, mostly) avoid the positions that are hard to argue for.
I'm not saying no one should ever be moved by a big long complicated argument*. But I think that if your argument fails to move a bunch of smart people, selected for good predictive track record to anything like your view of the matter, that is an extremely strong signal that your complicated argument is nowhere near good enough to escape the general sensible prior that long complicated arguments about how the future will go are wrong. This is particularly the case when your assessment of the argument might be biased, which I think is true for AI safety people: if they are right, then they are some of the most important people, maybe even THE most important people in history, not to mention the quasi-religious sense of meaning people always draw from apocalyptic salvation v. damnation type stories. Meanwhile the GJ superforecasters don't really have much to lose if they decide "oh, I am wrong, looking at the arguments, the risk is more like 2-3% than 1 in 1000". (I am not claiming that there is zero reason for the supers to be biased against the hypothesis, but just that the situation is not very symmetric.) I think I would feel quite different about what this exercise (probably) shows, if the supers had all gone up to 1-2%, even though that is a lot lower than the concerned group.
I do wonder (though I think other factors are more important in explaining the opinions of the concerned group) whether familiarity with academic philosophy helps people be less persuaded by long complicated arguments. Philosophy is absolutely full of arguments that have plausible premises and are very convincing to their proponents, but which nonetheless fail to produce convergence amongst the community. After seeing a lot of that, I got used to not putting that much faith in argument. (Though plenty philosophers remain dogmatic, and there are controversial philosophical views I hold with a reasonable amount of confidence.) I wonder if LessWrong functions a bit like a version of academic philosophy where there is-like philosophy-a strong culture of taking arguments seriously and trying to have them shape your views-but where consensus actually is reached on some big picture stuff. That might make people who were shaped by LW intellectually rather more optimistic about the power of argument (even as many of them would insist LW is not "philosophy".) But it could just be an effect of homogeneity of personalities among LW users, rather than a sign that LW was converging on truth.
*(Although personally, I am much more moved by "hmmm, creating a new class of agents more powerful than us could end with them on top; probably very bad from our perspective" than I am by anything more complicated. This is, I think a kind of base rate argument, based off of things like the history of colonialism and empire; but of course the analogy is quite weak, given that we get to create the new agents ourselves.)
The smart people were selected for having a good predictive track record on geopolitical questions with resolution times measured in months, a track record equaled or bettered by several* members of the concerned group. I think this is much less strong evidence of forecasting ability on the kinds of question discussed than you do.
*For what it's worth, I'd expect the skeptical group to do slightly better overall on e.g. non-AI GJP questions over the next 2 years, they do have better forecasting track records as a group on this kind of question, it's just not a stark difference.
I agree this is quite different from the standard GJ forecasting problem. And that GJ forecasters* are primarily selected for and experienced with forecasting quite different sorts of questions.
But my claim is not "trust them, they are well-calibrated on this". It's more "if your reason for thinking X will happen is a complex multi-stage argument, and a bunch of smart people with no particular reason to be biased, who are also selected for being careful and rational on at least some complicated emotive stuff, spend hours and hours on your argument and come away with a very different opinion on its strength, you probably shouldn't trust the argument much (though this is less clear if the argument depends on technical scientific or mathematical knowledge they lack**)". That is, I am not saying "supers are well-calibrated, so the risk probably is about 1 in 1000". I agree the case for that is not all that strong. I am saying "if the concerned group's credences are based in a multi-step, non-formal argument whose persuasiveness the supers feel very differently about, that is bad sign for how well-justified those credences are."
Actually, in some ways, it might look better for AI X-risk work being a good use of money if the supers were obviously well-calibrated on this. A 1 in 1000 chance of an outcome as bad as extinction is likely worth spending some small portion of world GDP on preventing. And AI safety spending so far is a drop a bucket compared to world GDP. (Yeah, I know technical the D stands for domestic so "world GDP" can't be quite the right term, but I forget the right one!). Indeed "AI risk is at least 1 in 1000" is how Greaves and MacAskill justify the "we can make a big difference to the long-term future in expectation" in 'The Case for Strong Longtermism'. (If a 1 in 1000 estimate is relatively robust, I think it is a big mistake to call this "Pascal's Mugging".)
*(of whom I'm one as it happens, though I didn't work on this: did work on the original X-risk forecasting tournament.)
**I am open to argument that this actually is the case here.
Why do you think superforecasters who were selected specifically for assigning a low probability to AI x-risk are well described as "a bunch of smart people with no particular reason to be biased"?
For the avoidance of doubt, I'm not upset that the supers were selected in this way, it's the whole point of the study, made very clear in the write-up, and was clear to me as a participant. It's just that "your arguments failed to convince randomly selected superforecasters" and "your arguments failed to convince a group of superforecasters who were specifically selected for confidentiality disagreeing with you" are very different pieces of evidence.
One small clarification: the skeptical group was not all superforecasters. There were two domain experts as well. I was one of them.
I'm sympathetic to David's point here. Even though the skeptic camp was selected for their skepticism, I think we still get some information from the fact that many hours of research and debate didn't move their opinions. I think there are plausible alternative worlds where the skeptics come in with low probabilities (by construction), but update upward by a few points after deeper engagement reveals holes in their early thinking.
Ok, I slightly overstated the point. This time, the supers selected were not a (mostly) random draw from the set of supers. But they were in the original X-risk tournament, and in that case too, they were not persuaded to change their credences via further interaction with the concerned (that is the X-risk experts.) Then, when we took the more skeptical of them and gave them yet more exposure to AI safety arguments, that still failed to move the skeptics. I think taken together, these two results show that AI safety arguments are not all that persuasive to the average super. (More precisely, that no amount of exposure to them will persuade all supers as a group to the point where they get a median significantly above 0.75% in X-risk by the centuries end.)
TL;DR Lots of things are believed by some smart, informed, mostly well calibrated people. It's when your arguments are persuasive to (roughly) randomly selected smart, informed, well-calibrated people that we should start being really confident in them. (As a rough heuristic, not an exceptionless rule.)
They weren't randomly selected, they were selected specifically for scepticism!
Ok yes, in this case they were.
But this is a follow-up to the original X-risk tournament, where the selection really was fairly random (obviously not perfectly so, but it's not clear in what direction selection effects in which supers participated biased things.) And in the original tournament, the supers were also fairly unpersuaded (mostly) by the case for AI X-risk. Or rather, to avoid putting it in too binary a way, they didn't not move their credences further on hearing more argument after the initial round of forecasting. (I do think the supers level of concern was enough to motivate worrying about AI given how bad extinction is, so "unpersuaded" is a little misleading.) At that point, people then said 'they didn't spend the enough time on it, and they didn't get the right experts'. Now, we have tried further with different experts, more time and effort lots of back and forth etc. and those who participated in the second round are still not moved. Now, it is possible that the only reason the participants were not moved 2nd time round was because they were more skeptical than some other supers the first time round. (Though the difference between medians of 0.1% and 0.3% medians in X-risk by 2100 is not that great.) But I think if you get 'in imperfect conditions, a random smart crowd were not moved at all, then we tried the more skeptical ones in much better conditions and they still weren't moved at all', the most likely conclusion is that even people from the less skeptical half of the distribution from the first go round would not have moved their credences either had they participated in the second round. Of course, the evidence would be even stronger if the people had been randomly selected the first time as well as the second.