The Forecasting Research Institute conducted a survey asking different kinds of experts (including technical and non-technical) many questions about AI progress. The report, which was just published, is here. I've only looked at the report briefly and there is a lot that could be examined and discussed.
The major flaw I want to point out is in the framing of a question where survey respondents are presented with three different scenarios for AI progress: 1) the slow progress scenario, 2) the moderate progress scenario, and 3) the rapid progress scenario.
All three scenarios describe what will happen by the end of 2030. Respondents have to choose between the three scenarios. There are only three options; there is no option to choose none.
First, two important qualifications. Here’s qualification #1, outlined on page 104:
In the following scenarios, we consider the development of AI capabilities, not adoption. Regulation, social norms, or extended integration processes could all prevent the application of AI to all tasks of which it is capable.
Qualification #2, also on page 104:
We consider a capability to have been achieved if there exists an AI system that can do it:
- Inexpensively: with a computational cost not exceeding the salary of an appropriate
2025 human professional using the same amount of time to attempt the task.- Reliably: what this means is context-dependent, but typically we mean as reliably as, or more reliably than, a human or humans who do the same tasks professionally in 2025.
With that said, here is the scenario that stipulates the least amount of AI progress, the slow progress scenario (on page 105):
Slow Progress
By the end of 2030 in this slower-progress future, AI is a capable assisting technology for humans; it can automate basic research tasks, generate mediocre creative content, assist in vacation planning, and conduct relatively standard tasks that are currently (2025) performed by humans in homes and factories.
Researchers can benefit from literature reviews on almost any topic, written at the level of a capable PhD student, yet AI systems rarely produce novel and feasible solutions to difficult problems. As a result, genuine scientific breakthroughs remain almost entirely the result of human-run labs and grant cycles. Nevertheless, AI tools can support other research tasks (e.g., copy editing and data cleaning and analysis), freeing up time for researchers to focus on higher-impact tasks. AI can handle roughly half of all freelance software-engineering jobs that would take an experienced human approximately 8 hours to complete in 2025, and if a company augments its customer service team with AI, it can expect the model to be able to resolve most complaints.
Writers enjoy a small productivity boost; models can turn out respectable short stories, but full-length novels still need heavy human rewriting to avoid plot holes or stylistic drift. AI can make a 3-minute song that humans would blindly judge to be of equal quality to a song released by a current (2025) major record label. At home, an AI system can draft emails, top up your online grocery cart, or collate news articles, and—so long as the task would take a human an hour or less and is well-scoped—it performs on par with a competent human assistant. With a few prompts, AI can create an itinerary and make bookings for a weeklong family vacation that feels curated by a discerning travel agent.
Self-driving car capabilities have advanced, but none have achieved true level-5 autonomy. Meanwhile, household robots can make a cup of coffee and unload and load a dishwasher in some modern homes—but they can’t do it as fast as most humans and they require a consistent environment and occasional human guidance. In advanced factories, autonomous systems can perform specific, repetitive tasks that require precision but little adaptability (e.g., wafer handling in semiconductor fabrication facilities).
So, in the slowest progress scenario, by the end of 2030, respondents are to imagine what is either nearly AGI or AGI outright.
This is a really strange way to frame the question. The slowest scenario is extremely aggressive and the moderate and rapid scenarios are even more aggressive. What was the Forecasting Research Institute hoping to learn here?
Edited on Friday, November 14, 2025 at 9:30 AM Eastern to add the following.
Here’s the question respondents are asked with regard to these scenarios (on page 104):
At the end of 2030, what percent of LEAP panelists will choose “slow progress,” “moderate progress,” or “rapid progress” as best matching the general level of AI progress?
The percent of panelists forecasted by the respondents is then stated in the report as the probability the respondents assign to each scenario (e.g. on page 38).
Edit #2 on Friday, November 14, 2025 at 6:25 PM Eastern.
I just noticed that the Forecasting Research Institute made a post on the EA Forum a few days ago that presents the question results as probabilities:
By 2030, the average expert thinks there is a 23% chance of a “rapid” AI progress scenario, where AI writes Pulitzer Prize-worthy novels, collapses years-long research into days and weeks, outcompetes any human software engineer, and independently develops new cures for cancer. Conversely, they give a 28% chance of a slow-progress scenario, in which AI is a useful assisting technology but falls short of transformative impact.
If the results are going to be presented this way, it seems particularly important to consider the wording and framing of the question.
Edit #3 on Saturday, November 15, 2025 at 10:10 PM Eastern.
I just learned that, in 2023, the Forecasting Research Institute published a survey on existential risk where the wording of a question changed the respondents’ estimated probability by 750,000x. When asked to estimate the odds of human extinction by 2100 in terms of a percentage, the median response was 5% or 1 in 20. When asked to estimate in terms of a 1-in-X chance with some examples of probabilities of events (e.g. being struck by lightning), the median response was 1 in 15 million or 0.00000667%. Details here.
Since there seems to be some doubt as to whether the way a question is worded or presented can actually bias the responses that much and whether this is really such a big deal — let there be no more doubt!
Edit #4 on Thursday, November 20, 2025 at 3:00 PM Eastern.
There is an additional concern — which is entirely separate and distinct from anything mentioned in the post above or any of the edits — with the intersubjective resolution/metaprediction framing of the question. See the volcano analogy in my comment here. (See also my Singularity example in my subsequent comment here.) I may be mistaken, but I don’t see how we can derive what the respondents’ probability for each scenario is from a question that doesn’t ask anything directly about probability.
Respondents are asked to predict, in December 2030, “what percent of LEAP panelists will choose” each scenario (not with any probability). This implies that if they think there’s, say, a 51% chance that 30% of LEAP panelists will choose the slow scenario, they should respond to the question by saying 30% will choose the slow scenario. If they think there’s a 99% chance that 30% of LEAP panelists will choose the slow scenario, they should also respond by saying 30% will choose the slow scenario. In either case, the number in their answer is exactly the same, despite a 48-point difference in the probability they assign to this outcome. The report says that 30% is the probability respondents assign to the slow scenario, but it’s not clear that the respondents’ probability is 30%.
The Forecasting Research Institute only asks for the predicted “vote share” for each scenario and not the estimated probabilities behind those vote share predictions. It doesn’t seem possible to derive the respondents’ probability estimates from the vote share predictions alone. By analogy, if FiveThirtyEight’s 2020 election forecast predicts that Joe Biden will win a 55% share of the national vote, this doesn’t tell you what probability the model assigns to Biden winning the election (whether it’s, say, 70%, 80%, or 90%). The model’s probability is certainly not 55%. To know the model’s probability or guess at it, you would need information other than just the predicted vote share.

(An author of the report here.) Thanks for engaging with this question and providing your feedback! I'll provide a few of my thoughts. But, I will first note that EA forum posts by individuals affiliated with FRI do not constitute official positions.
I do think the following qualification we provided to forecasters (also noted by Benjamin) is important: Reasonable people may disagree with our characterization of what constitutes slow, moderate, or rapid AI progress. Or they may expect to see slow progress observed with some AI capabilities and moderate or fast progress in others. Nevertheless, we ask you to select which scenario, in sum, you feel best represents your views.
I would also agree with Benjamin that "best matching" covers scenarios with slower and faster progress than the slow and fast progress scenarios, respectively. And, I believe our panel is sophisticated and capable understanding this feature. Additionally, this question was not designed to capture the extreme possibilities for AI progress, and I personally wouldn't use it to inform my views on these extreme possibilities (I think the mid-probability space is interesting and undeexplored, and we want LEAP to fill this gap). Given this, however, you are correct that we ought to include the "best matching" qualification when we present these results, and I've added this to our paper revision to-do list. Thanks for pointing that out.
I think other questions in the survey do a better job of covering the full range of possibilities, both in scenarios questions (i.e., TRS) and our more traditional, easily resolvable forecasting questions. The latter group comprise the vast majority of our surveys. I think it's impossible to write a single forecasting question that satisfies any reasonable and comprehensive set of desiderata, so I'd view LEAP as a portfolio of questions.
On edit #2, I would first note that it is challenging to write a set of scenarios for AI progress without an explosion of scenarios (and an associated increase in survey burden, which would itself degrade response quality); we face a tradeoff between parsimony and completeness. This specific question in the first survey is uniquely focused on parsimony, and we attempted to include questions that take other stances on that tradeoff. However, we'd love to hear any suggestions you have for writing these types of questions, as we could certainly improve on this front. I think you've identified many of the shortcomings in this specific question already. Second, I would defend our choice to present as probabilities (but we should add the "best matching" qualifier). We're making an appeal to intersubjective resolution. Witkowski et al. (2017) is one example, and some people at FRI have done similar work (Karger et al. 2021). These two metrics rely on wisdom-of-the-crowd effects. Again, however, I don't think it's clear that we're making this appeal, so I've added a note to clarify this in the paper. We use a resolution criterion (metaprediction) that some find unintuitive, but it allows us to incentivize this question. But, others might argue that incentives are less important.
While I think framing effects obviously matter in surveys, I do think that your edit #3 is conflating an elicitation/measurement/instrumentation issue in low-probability forecasting with the broader phenomenon of framing, which I view as being primarily but not exclusively about question phrasing. We're including tests on framing and the elicitation environment in LEAP itself to make sure our results aren't too sensitive to any framing effects, and we'll be sharing more on those in the future. I'd love to hear any ideas for experiments we should run there.
In sum, I largely defend the choices we made in writing this question. LEAP includes many different types of questions, because consumers of the research will have different views of the types of questions they will find informative and convincing. I will note that even within FRI some people personally find the scenarios questions much less compelling than the other questions in the survey. Nevertheless, I think you identified issues with our framing of the results, and we will make some changes. I appreciate you laying out your criticisms of the paper clearly so that we can dig into them, and I'd welcome any additional feedback!
Thank you very much for your reply. I especially want to give you my profound appreciation for being willing to revise how your results are described in the report. (I hope you will make the same revision in public communications as well, such as blog posts or posts on this forum.) A few responses which I tried to keep as succinct as possible, but failed to keep succinct:
Thanks again for a helpful, cooperative, and open reply.
Thanks for following up!
I am using ‘extreme’ in a very narrow sense, meaning anything above and below the scale provided for this specific question, rather than any normative sense, or making any statement about probabilities. I think people interpret this word differently. I additionally think we have some questions that represent a broader swath of possible outcomes (e.g., TRS), taking a different position on the the parsimony and completeness frontier. I suspect we have different goals in mind for this question.
I think others would argue that the slow progress scenario is barely an improvement over current capabilities. Given the disagreement people have over current capabilities, this disagreement on how much progress a certain scenario represents will always exist. We notably had some people who take the opposite stance you do, that the slow progress scenario has already been achieved.
I would maintain that we can express these results as the probability that reality best matches a certain scenario, hence the needed addition of the “best matches” qualifier. So, I’m not following your points here, apologies.
And for what it’s worth, I think the view that tasks = occupations is reasonably disputed. Again, I still grant the point that framing matters, and absolutely could be at play here. In fact, I’d argue that it’s always at play everywhere, and we can and should do our best to limit its influence.
This is a really great exchange, and thank you for responding to the post.
I just wanted to leave a quick comment to say: It seems crazy to me that someone would say the "slow" scenario has "already been achieved"!
Unless I'm missing something, the "slow" scenario says that half of all freelance software engineering jobs taking <8 hours can be fully automated, that any task a competent human assistant can do in <1 hour can be fully automated with no drop in quality (what if I ask my human assistant to solve some ARC-2 problems for me?), that the majority of customer complaints in a typical business will be fully resolved by AI in those businesses that use it, and that AI will be capable of writing hit songs (at least if humans aren't made aware that it is AI-generated)?
I suppose the scenario is framed only to say that AI is capable of all of the above, rather than that it is being used like this in practice. That still seems like an incorrect summary of current capability to me, but is slightly more understandable. But in that case, it seems the scenario should have just been framed that way: "Slow progress: No significant improvement in AI capabilities from 2025, though possibly a significant increase in adoption". There could then be a separate question on what people think about the level that current capabilities are at?
Otherwise disagreements about current capabilities and progress are getting blurred in the single question. Describing the "slow" scenario as "slow" and putting it at the extreme end of the spectrum is inevitably priming people to think about current capabilities in a certain way. Still struggling to understand the point of view that says this is an acceptable way to frame this question.
Thanks for the thoughts! The question is indeed framed as being about capabilities and not adoption, and this is absolutely central.
Second, people have a wide range of views on any given topic, and surveys reflect this distribution. I think this is a feature, not a bug. Additionally, if you take any noisy measurement (which all surveys are), reading too much into the tails can lead one astray (I don't think that's happening in this specific instance, but I want to guard against the view that the existence of noise implies the nonexistence of signal). Nevertheless, I do appreciate the careful read.
Your comments here are part of why I think including the third disclaimer we add that allows for jagged capabilities is important. Additionally, we don't require that all capabilities are achieved, hence the "best matching" qualifier, rather than looking at the minimum across the capabilities space.
We indeed developed/tested versions of this question which included a section on current capabilities. Survey burden is another source of noise/bias in surveys, so such modifications are not costless. I absolutely agree that current views of progress will impact responses to this question.
I'll reiterate that LEAP is a portfolio of questions, and I think we have other questions where disagreement about current capabilities is less of an issue because the target is much less dependent on subjective assessment, but those questions will sacrifice some degree of being complete pictures of AI capabilities. Lastly, any expectation of the future necessarily includes some model of the present.
Always happy to hear suggestions for a new question or revised version of this question!
Thanks for replying again. This is helpful. (I am strongly upvoting your comments because I'm grateful for your contribution to the conversation and I think you deserve to have that little plant icon next to your name go away.)
Apologies for the word count of this comment. I'm really struggling to compress what I'm trying to say to something shorter.
On "extreme": Thank you for clarifying that non-standard/technical use of the word "extreme". I was confused because I just interpreted it in the typical, colloquial way.
On the content of the three scenarios: I have a hard time understanding how someone could say the slow progress scenario has already been achieved (or that it represents barely an improvement over existing capabilities), but the more I have these kinds of discussions, the more I realize people interpret exactly the same descriptions of hypothetical future AI systems in wildly different ways.
This seems like a problem for forecasting surveys — different respondents may mean completely different things yet, on paper, their responses are exactly the same. (I don't fault you or your co-authors for this, though, because you didn't create this problem and I don't think that I could do any better at writing unambiguous scenarios.)
But, more importantly, it's also a problem that goes far beyond the scope of just forecasting surveys. It's a problem for the whole community of people who want to have discussions about AI progress, which we have a shared responsibility to address. I am not sure quite what to do yet, but I've been thinking about it a bit over the last few weeks.[1]
On intersubjective resolution/metaprediction: My confusion about the intersubjective resolution or metaprediction for the three scenarios question is I don't know how respondents are supposed to express their probability of a scenario being best matching vs. expressing how ambiguous or unambiguous they think the resolution of the prediction will be. If I think there's a 51% chance that before the end of 2030 the Singularity will happen, in which case the prediction would resolve completely unambiguously for the rapid progress scenario, what should my response to the survey be?
Should I predict 100% of respondents will agree, retrospectively, that the rapid progress scenario is the best matching one, since that is what will happen in the scenario I think is 51% probable? Or should I predict 51% of respondents will pick the rapid progress scenario, even though that's not what the question is literally asking, because 51% is my probability? (Let's say for simplicity I think there's a 51% chance of an unambiguous Singularity of the sort described by futurists like Ray Kurzweil or Vernor Vinge before December 2030 and a 49% chance AI will make no meaningful progress between now and December 2030. And nothing in between.)
It's possible I just have no idea how intersubjective resolution/metaprediction is supposed to work, but then, was this explained to the respondents? Can you count on them understanding how it works?
On "tasks" vs. "occupations": I agree that, once you think about it, you can understand why people would think automating all "tasks" and automating all "occupations" wouldn't mean the same thing. However, this is not obvious (at least, not to everyone) in advance of asking two variants of the question and noticing the difference in the responses. The reasoning is that, logically, an occupation is just a set of tasks, so an AI that can do all tasks can also do all occupations. The authors of the AI Impacts survey were themselves surprised by the framing effect here. On page 7 of their pre-print about the survey, they say (emphasis added by me):
The broader problem with Benjamin Tereick's reply is that he seems to be saying (if I'm understanding correctly) you can conclude there is no significant framing effect just by looking at the responses to one variant of one question. But if the AI Impacts survey only asked about HLMI and not FAOL, and just assumed the two were logically equivalent and equivalent in the eyes of respondents, how would they know, just from that information, that the HLMI question was susceptible to a significant framing effect or not? They wouldn't know.
I don't see how someone could argue that the authors of the AI Impacts survey would be able to infer from the results of just the HLMI question, without comparing it to anything else, whether or not the framing of the question introduced significant bias. They wouldn't know. You have to run the experiment to know — that's the whole point. Benjamin's argument, which I may just be misunderstanding, seems analogous to the argument that a clinical trial of a drug doesn't need a control group because you can tell how effective the drug is just from the experimental group. (Benjamin, what am I missing here?)
That's why I brought up the AI Impacts survey example and the 2023 Forecasting Research Institute survey example. Just to drive home the point that framing effects/question wording bias/anchoring effects can be extremely significant, and we don't necessarily know that until we run two versions of the same question. So, I'm glad that you at least agree with the general point that this an important topic to consider.
I think, unfortunately, it's not a problem that's easily or quickly resolved, but will most likely involve a lot of reading and writing to get everyone on the same page about some core concepts. I've tried to do a little bit of this work already in posts like this one, but that's just a tiny step in the right direction. Concepts like data efficiency, generalization, continual learning, and fluid intelligence are helpful and much under-discussed. Open technical challenges like learning efficiently from video data (a topic the AI researcher Yann LeCun has talked a lot about) and complex, long-term hierarchical planning (a longstanding problem in reinforcement learning) are also helpful for understanding what the disagreements are about and are also much under-discussed.
One of the distinctions that seems to be causing trouble is understanding intelligence as the ability to complete tasks vs. intelligence as the ability to learn to complete tasks.
Another problem is people interpreting (sometimes despite instructions or despite what's stipulated in the scenario) an AI system's ability to complete a task in a minimal, technical sense vs. in a robust, meaningful sense, e.g., an LLM writing a terrible, incoherent novel that nobody reads or likes vs. a good, commercially successful, critically well-received novel (or a novel at that quality level).
A third problem is (again, sometimes despite warnings or qualifications that were meant to forestall this) around reliability: the distinction between an AI system being able to successfully complete a task sometimes, e.g., 50% or 80% or 95% of the time, vs. being able to successfully complete it at the same rate as humans, e.g. 99.9% or 99.999% of the time.
I suspect, but don't know, that another interpretive difficulty for scenarios like the ones in your survey is around people filling in the gaps (or not). If we say in a scenario that an AI system can do these five things we describe, like make a good song, write a good novel, load a dishwasher, and so on, some people can interpret that to mean the AI system can only do those five things. Other people can interpret these tasks as just representative of the overall set of tasks the AI system can do, such that there a hundred or a thousand or a million other things it can do, and these are just a few examples.
A little discouragingly, similar problems have persisted in discussions around philosophy of mind, cognitive science, and AI for decades — for example, in debates around the Turing test — despite the masterful interventions of brilliant writers who have tried to clear up the ambiguity and confusion (e.g. the philosopher Daniel Dennett's wonderful essay on the Turing test "Can machines think?" in the anthology Brainchildren).
This post is getting some significant downvotes. I would be interested if someone who has downvoted could explain the reason for that.
There's plenty of room for disagreement on how serious a mistake this is, whether it has introduced a 'framing' bias into other results or not, and what it means for the report as a whole. But it just seems straightforwardly true that this particular question is phrased extremely poorly (it seems disingenuous to suggest that the question using the phrasing "best matching" covers you for not even attempting to include the full range of possibilities in your list).
I assume that people downvoting are objecting to the way that this post is using this mistake to call the entire report into question, with language like "major flaw". They may have a point there. But I think you should have a very high bar for downvoting someone who is politely highlighting a legitimate mistake in a piece of research.
'Disagree' react to the 'major flaw' language if you like, and certainly comment your disagreements, but silently downvoting someone for finding a legitimate methodological problem in some EA research seems like bad EA forum behaviour to me!
Yes, please do not downvote Yarrow's post just because it's style is a bit abrasive, and it goes against EA consensus. She has changed my mind quite a lot, as the person who kicked off the dispute, and Connacher who worked on the survey is clearly taking her criticisms seriously.
God bless!
Thank you very much. I really appreciate your helpful and cooperative approach.
The "best matching" wording of the question doesn’t, in my view, change the underlying problem of presenting these as the only three options.
It’s also a problem, in my view, that the "best matching" wording is dropped on page 38 and the report simply talks about the probability respondents assign to the scenario. I looked at the report in the first place because a Forecasting Research Institute employee just said (on the EA Forum) what the probability assigned to a scenario was, and didn’t mention the "best matching" wording (or the three-scenario framing). If you include "best matching" in the question and then drop it when you present the results, what was the point of saying "best matching" in the first place?
[Edited on 2025-11-14 at 6:32 PM Eastern to add: The Forecasting Research Institute also presented the results as experts' probabilities for these scenarios in a post on the EA Forum. See Edit #2 added to the post above.]
I didn’t intend for the post to come across as more than a criticism of this specific question in the survey — I said that the report contains many questions and said "I've only looked at the report briefly and there is a lot that could be examined and discussed". I meant the title literally and factually. This is a major flaw that I came across in the report.
I would be happy to change the title of the post or change the wording of the post if someone can suggest a better alternative.
If people have qualms with either the tone or the substance of the post, I’d certainly like to hear them. So, I encourage people to comment.
I’m confused by this, for a few reasons:
Aside from these methodological points, I’m also surprised that you believe that the slow scenario constitutes AI that is "either nearly AGI or AGI outright". Out of curiousity, what capability mentioned in the "slow" scenario do you think is the most implausible by 2030? To me, most of these seem pretty close to what we already have in 2025.
[disclaimer: I recommended a major grant to FRI this year, and I’ve discussed LEAP with them several times]
Thanks for your comment. My response:
There are several capabilities mentioned in the slow progress scenario that seem indicative of AGI or something close, such as the ability of AI systems to largely automate various kinds of labour (e.g. research assistant, software engineer, customer service, novelist, musician, personal assistant, travel agent) and ”produce novel and feasible solutions to difficult problems”, albeit "rarely”. The wording is ambiguous as to whether ”genuine scientific breakthroughs" will sometimes, if only rarely, be made by AI, in the same manner they would be made by a human scientist leading a research project, or if AI will only assist in such breakthroughs by automating ”basic research tasks”.
The more I discuss these kinds of questions or scenarios, the more I realize how differently people interpret them. It’s a difficulty both for forecasting questions and for discussions of AI progress more broadly, since people tend to imagine quite different things based on the same description of a hypothetical future AI system, and it’s not always immediately clear when people are talking past each other.
Thanks for the replies!
I think it would also be fair to include the disclaimer in the question I quoted above.
I would read the scenario as AI being able to do some of the tasks required by these jobs, but not to fully replace humans doing them, which I would think is the defining characteristic of slow AI progress scenarios.
Thank you for your follow-up.
1. How much it matters depends how people interpret the report and how they use it as evidence or in argumentation. I wrote this post because a Forecasting Research Institute employee told me the percentage is the probability assigned to each scenario and that I, personally, should adjust my probability of AGI by 2030 as a result. People should not do this.[1]
3. You can guess that it didn’t, I can guess that it did, but the point is these survey questions should be well-framed in the first place. We shouldn’t have to guess how much a methodological problem impacted the results.
Footnote added on 2015-11-14 at 6:32 PM Eastern:
The Forecasting Research Institute also presented the results as experts' probabilities for these scenarios in a post on the EA Forum. See Edit #2 added to the post above.
1. Are you referring to your exchange with David Mathers here?
3. I'm not sure what you're saying here. Just to clarify what my point is: you're arguing in the post that the slow scenario actually describes big improvements in AI capabilities. My counterpoint is that this scenario is not given a lot of weight by the respondents, suggesting that they mostly don't agree with you on this.
You are guessing you know how the framing affected the results, which is your right, but it is my right to guess something different, and the whole point we do surveys is not to guess but to know. If we wanted to rely on guesses, we could have saved the Forecasting Research Institute the trouble of running the whole survey in the first place!
I don't think this is an accurate summary of the disagreement, but I've tried to clarify my point twice already, so I'm going to leave it at that.
I don't mind if you don't respond — it's fair to leave a discussion whenever you like — but I want to try to make my point clear for you and for anyone else who might read this post.
How do you reckon the responses to the survey question would be different if there were a significant question wording effect biasing the results? My position is: I simply don't know and can't say how the results would be different if the question were phrased in such a way as to better avoid a question wording effect. The reason to run surveys is to learn that and be surprised. If the question were worded and framed differently, maybe the results would be very different, maybe they would be a little different, maybe they would be exactly the same. I don't know. Do you know? Do you actually know for sure? Or are you just guessing?
What if we consider the alternative? Let's say the response was something like, I don't know, 95% in favour of the slow progress scenario, 4% for the moderate scenario, and 1% for the rapid. Just to imagine something for the sake of illustration. Then you could also argue against a potential question wording effect biasing the results by appealing to the response data. You could say: well, clearly the respondents saw past the framing of the question and managed to accurately report their views anyway.
This should be troubling. If a high percentage can be used to argue against a question wording effect and a low percentage can be used to argue against a question wording effect, then no matter what the results are, you can argue that you don't need to worry about a potential methodological problem because the results show they're not a big deal. If any results can be used to argue against a methodological problem, then surely no results should be used to argue against a methodological problem. Does that make sense?
I don't feel like I'm inventing the wheel here, but just talking about common concerns with how surveys are designed and worded. In general, you can't know whether a response was biased or not just by looking at the data and not the methodology.
For reference, here are the results on page 141 of the report:
Yeah, the error here was mine sorry. I didn't actually work on the survey, and I missed that it was actually estimating the % of the panel agreeing we are in a scenario, not the chance that that scenario will win a plurality of the panel. This is my fault not Connacher's. I was not one of the survey designers, so please do not assume from this that the people at the FRI who designed the survey didn't understand their own questions or anything like that.
For what it's worse, I think this is decent evidence that the question is too confusing to be useful given that I mischaracterized it even though I was one of the forecasters. So I largely, although not entirely withdraw the claim that you should update on the survey results. (That is, I think it still constitutes suggestive evidence that you are way out of line with experts-and superforecasters- but no longer super-strong.)
I also somewhat withdraw the claim that we should take even well-designed expert surveys as strong evidence of the actual distribution of opinions. I had forgotten the magnitude of the framing effect that titotal found for the human extinction questions. That really does somewhat call the reliability of even a decently designed survey into question. That said, I don't really see a better way to get at "what do experts" think than surveys here, and I doubt they have zero value. But people should probably test multiple framings more. Nonetheless "there could be a big framing effect because it asks for a %", i.e. the titotal criticism, could apply to literally any survey, and I'm a bit skeptical of "surveys are a zero value method of getting at expert opinion".
So I think I concede that you were right not to be massively moved by the survey, and I was wrong to say you should be. That said, maybe I'm wrong, but I seem to recall that you frequently imply that EA opinion on the plausibility of AGI by 2032 is way out of step with what "real experts" think. If your actual opinion is that no one has ever done a well-designed survey, then you should probably stop saying that. Or cite a survey you think is well-designed that actually shows other people are more out of step with expert opinion than you are, or say that EAs are out of step with expert opinion in your best guess, but you can't really claim with any confidence that you are any more in line with it. My personal guess is that your probabilities are in fact several orders of magnitude away from the "real" median of experts and superforecasters, if we could somehow control for framing effects, but I admit I can't prove this.
But I will say that if taken at face value the survey still shows a big gap between what experts think and your "under 1 in 100,000 chance of AGI by 2032" (That is, you didn't complain when I attributed that probability to you in the earlier thread, and I don't see any other way to interpret "more likely that JFK is secretly still alive" given you insisted you meant it literally.) Obviously, if someone is thinking that the most likely outcome in 2030 we will be in a situation where approx. 1 in 4 people on the panel think we are already in the rapid scenario, they probably don't think the chance of AGI by 2032 is under 1 in 100,000, since they are basically predicting that we're going to bear near the upper end of the moderate scenario, which makes it hard to give a chance of AGI by 2 years after 2030 that low. (I suppose they just could have a low opinion of the panel, and think some of the members will be total idiots, but I consider that unlikely.) I'd also say that if forecasters made the mistake I did in interpreting the question, then again, they are clearly out of step with the probability you give. I'm also still prepared to defend the survey against some of your other criticisms
I really appreciate that, but the report itself made the same mistake!
Here is what the report says on page 38:
And the same mistake is repeated again on the Forecasting Research Institute's Substack in a post which is cross-posted on the EA Forum:
There are two distinct issues here: 1) the "best matching" qualifier, which contradicts these unqualified statements about probability and 2) the intersubjective resolution/metaprediction framing of the question, which I still find confusing but I'm waiting to see if I can ultimately wrap my head around. (See my comment here.)
I give huge credit to Connacher Murphy for acknowledging that the probability should not be stated without the qualifier that this is only the respondents' "best matching" scenario, and for promising to revise the report with that qualifier added. Kudos, a million times kudos. My gratitude and relief is immense. (I hope that the Forecasting Research Institute will also update the wording of the Substack post and the EA Forum post to clarify this.)
Conversely, it bothers me that Benjamin Tereick said that it's only "slightly inaccurate" and not "a big issue" to present this survey response as the experts' unqualified probabilities. Benjamin doesn't work for the Forecasting Research Institute, so his statements don't affect your organization's reputation in my books, but I find that frustrating. In case it's in doubt: making mistakes is absolutely fine and no problem. (Lord knows I make mistakes!) Acknowledging mistakes increases your credibility. (I think a lot of people have this backwards. I guess blame the culture we live in for that.)
You're right!
It would be very expensive and maybe just not feasible, but in my opinion the most interesting and valuable data could be obtained from long-form, open-ended, semi-unstructured, qualitative research interviews.
Here's why I say that. You know what the amazing part of this report is? The rationale examples! Specifically, the ones on page 142 and 143. We only get morsels, but, for me, this is the main attraction, not an afterthought.
For example, we get this rationale examples in support of the moderate progress scenario:
Therein lies the crux! Personally, I strongly believe the METR time horizon graph is not evidence of anything significant with regard to AGI, so it's good to know where the disagreement lies.
Or this other rationale example for the moderate progress scenario:
This is unfathomable to me! Toby Crisford expressed similar incredulity about someone saying the same for the slow scenario, and I agree with him.
Either this respondent is just interpreting the scenario description completely differently than I am (this does happen kind of frequently), or, if they're interpreting it the same way I am, the respondent is expressing a view that not only do I just not believe, I have a hard time fathoming how anyone could believe it.
So, it turns out it's way more interesting to find out why people disagree than to find out that they disagree.
Listen, I'd read surveys all day if I could. The issue is just — using this FRI survey and the AI Impacts survey as the two good examples we have — it turns out survey design is thornier and more complicated than anyone realized going in.
Thank you! I appreciate it!
Before this FRI survey, the only expert survey we had on AGI timelines was the AI Impacts survey. I found the 69-year framing effect in that survey strange. Here's what I said about it in a comment 3 weeks ago:
After digging into this FRI survey — and seeing titotal's brilliant comment about the 750,000x anchoring effect for the older survey — now I'm really questioning whether I should keep citing the AI Impacts survey at all. Even recognizing the framing effect in the AI Impacts survey was very strange, I didn't realize the full extent to which the results of surveys could be the artefacts of survey design. I thought of that framing effect as a very strange quirk, and a big question mark, but now it seems like the problem is bigger and more fundamental than I realized. (Independently, I've recently been realizing just how differently people imagine AGI or other hypothetical future AI systems, even when care is taken to give a precise definition or paint a picture with a scenario like in the FRI survey.)
There's also the AAAI survey where 76% of experts said it's unlikely or very unlikely that current AI approaches (including LLMs) would scale to AGI (see page 66). That doesn't ask anything about timing. But now I'm questioning even that result, since who knows if "AGI" is well-defined or what the respondents mean by that term?
I think you might be mixing up two different things. I have strong words about people in EA who aren't aware that many experts (most, if you believe the AAAI survey) think LLMs won't scale to AGI, who don't consider this to be a perspective worth seriously discussing, or who have never heard of or considered this perspective before. In that sense, I think the average/typical/median EA opinion is way out of step with what AI experts think.
When it comes to anything involving timelines (specifically, expert opinion vs. EA opinion), I don't have particularly strong feelings about that, and my words have been much less strong. This is what I said in a recent post, which you commented on:
I'm not even sure I'm not an outlier compared to AI experts. I was just saying if you take the surveys at face value — which now I'm especially questioning — my view looks a lot less like an outlier than if you used the EA Forum as the reference class.
The median might obscure the deeper meaning we want to get to. The views of experts might be quite diverse. For example, there might be 10% of experts who put significantly less than a 1% probability on AGI within 10 years and 10% of experts who put more than a 90% probability on it. I'm not really worried about being off from the median. I would be worried if my view, or something close to it, wasn't shared by a significant percentage of experts — and then I would really have to think carefully.[1]
For example, Yann LeCun, one of the foremost AI researchers in the world, has said it's impossible that LLMs will scale to AGI — not improbable, not unlikely, impossible. Richard Sutton, another legendary AI researcher, has said LLMs are a dead end. This sounds like a ~0% probability, so I don't mind if I agree with them and have a ~0% probability as well.
Surely you wouldn't argue that all the experts should circularly update until all their probabilities are the same, right? That sounds like complete madness. People believe what they believe for reasons, and should be convinced on the grounds of reasons. This is the right way to do it. This is the Enlightenment’s world! We’re just living in it!
Maybe something we're clashing over here is the difference between LessWrong-style/Yudkowskyian Bayesianism and the traditional, mainstream scientific mindset. LessWrong-style/Yudkowskyian Bayesianism emphasizes personal, subjective guesses about things — a lot. It also emphasizes updating quickly, even making snap judgments.
The traditional, mainstream scientific mindset emphasizes cautious vetting of evidence before making updates. It prefers to digest information slowly, thoughtfully. There is an emphasis on not making claims or stating numbers that one cannot justify, and on not relying on subjective guesses or probabilities.
I am a hardcore proponent of the traditional, mainstream scientific mindset. I do not consider LessWrong-style/Yudkowskyian Bayesianism to be of any particular merit or value. It seems to make people join cults more often than it leads them to great scientific achievement. Indeed, Yudkowsky's/LessWrong's epistemology has been criticized as anti-scientific, a criticism I'm inclined to agree with.
This is slightly incorrect, but it's almost not important enough to be worth correcting. I said there was a less than 1 in 10,000 chance of the rapid progress scenario in the survey, without the adoption vs. capabilities caveat (which I wasn't given at the time) — which I take to imply a much higher bar than "merely" AGI. I also made this exact same correction once before, in a comment you replied to. But it really doesn't matter.
On reflection, I probably would say the probability of the rapid progress scenario (without the adoption vs. capabilities caveat) — a widely deployed, very powerful superhuman AGI or superintelligence by December 2030 — has less than a 1 in 100,000 chance of coming to pass, though. Here's my reasoning. The average chance of getting struck by lightning over 5 years is, as I understand, about 1 in 250,000. I would be more surprised if the rapid scenario (without the adoption vs. capabilities caveat) came true than if I got struck by lightning within that same timeframe. Intuitively, me getting struck by lightning seems more likely than the rapid scenario (without the adoption vs. capabilities caveat). So, it seems like I should say the chances are less than 1 in 250,000.
That said, I'm not a forecaster, and I'm not sure how to calibrate subjective, intuitive probabilities significantly below 1% for unprecedented eschatological/world-historical/natural-historical events of cosmic significance that may involve the creation of new science that's currently unknown to anyone in the world, and that can't be predicted mechanically or statistically. What does the forecasting literature say about this?
AGI is a lower bar than the rapid scenario (in my interpretation), especially without the caveat, and the exact, nitpicky definition of AGI could easily make the probability go up or down a lot. For now, I'm sticking with significantly less than a 1 in 1,000 chance of AGI before the end of 2032 on my definition of AGI.[2] If I thought about it more, I might say the chance is less than 1 in 10,000, or less than 1 in 100,000, or less than 1 in 250,000, but I'd have to think about it more, and I haven't yet.
What probability of AGI before the end of 2032, on my definition of AGI (see the footnote), would you give?
Let me know if there’s anything in your comment I didn’t respond to that you’d like an answer about.
If you happen to be curious, I wrote a post trying to integrate two truths: 1) if no one ever challenged the expert consensus or social consensus, the world would suck and 2) the large majority of people who challenge the expert consensus or social consensus are not able to improve upon it.
From a post a month ago:
Reliability and data efficiency are key concepts here. So are fluid intelligence, generalization, continual learning, learning efficiently from video/visual information, and hierarchical planning. In my definition of the term, these are table stakes for AGI.
It would be an interesting exercise to survey AI researchers and ask them to assign a probability for each of these problems being solved to human-level by some date, such as 2030, 2040, 2050, 2100, etc. Then, after all that, ask them to estimate the probability of all these problems being solved to human-level. This might be an inappropriately biasing framing. Or it might be a framing that actually gets the crux of the matter much better than other framings. I don't know.
I don't make a prediction about what the median response would be, but I suspect the distribution of responses would include a significant percentage of researchers who think the probability of solving all these problems by 2030 or 2040 is quite low. Especially if they were primed to anchor their probabilities to various 1-in-X probability events rather than only expressing them as a percentage. (Running two versions of the question, one with percentages and one with 1-in-X probabilities, could also be interesting.)