Introduction
I have been writing posts critical of mainstream EA narratives about AI capabilities and timelines for many years now. Compared to the situation when I wrote my posts in 2018 or 2020, LLMs now dominate the discussion, and timelines have also shrunk enormously. The ‘mainstream view’ within EA now appears to be that human-level AI will be arriving by 2030, even as early as 2027. This view has been articulated by 80,000 Hours, on the forum (though see this excellent piece excellent piece arguing against short timelines), and in the highly engaging science fiction scenario of AI 2027. While my article piece is directed generally against all such short-horizon views, I will focus on responding to relevant portions of the article ‘Preparing for the Intelligence Explosion’ by Will MacAskill and Fin Moorhouse.
Rates of Growth
The authors summarise their argument as follows:
Currently, total global research effort grows slowly, increasing at less than 5% per year. But total AI cognitive labour is growing more than 500x faster than total human cognitive labour, and this seems likely to remain true up to and beyond the point where the cognitive capabilities of AI surpasses all humans. So, once total AI cognitive labour starts to rival total human cognitive labour, the growth rate of overall cognitive labour will increase massively. That will drive faster technological progress.
MacAskill and Moorhouse argue that increases in training compute, inference compute and algorithmic efficiency have been increasing at a rate of 25 times per year, compared to the number of human researchers which increases 0.04 times per year, hence the 500x faster rate of growth. This is an inapt comparison, because in the calculation the capabilities of ‘AI researchers’ are based on their access to compute and other performance improvements, while no such adjustment is made for human researchers, who also have access to more compute and other productivity enhancements each year.
It is also highly unclear if current rates of increase can reasonably be extrapolated. In particular, the components of the rate of increase of ‘AI researchers’ are not independent, since if the rate of algorithmic improvement slows, then it is highly likely investments in training and inference compute will also slow down. Furthermore, most new technologies improve very rapidly at first and then performance significantly slows; the cost for genome sequencing is a good recent example. Such a slowdown may already be beginning. For example, after months of anticipation prior to its release in February, OpenAI recently announced they will remove their new GPT4.5 model from API access in July. This apparently is due to the high cost of such a large model with only modest improvement in performance. The recent release of Llama 4 was also met with mixed reception owing to disappointing performance and controversies about its development. For all these reasons, I do not believe the 500-fold greater rate of increase in ‘AI researchers’ compared to human researchers is particularly accurate nor can it be confidently extrapolated to continue over the coming decade.
The authors then argue that even in the absence of continued increases in compute, deployment of AI to improve AI research could lead to a ‘software feedback loop’, where AI capabilities continue to improve due to improvements in AI capabilities elicited by AI researchers. MacAskill and Moorhouse defend this claim by quoting evidence that “empirical estimates of efficiency gains in various software domains suggest that doubling cognitive inputs (research effort) generally yields more than a doubling in software performance or efficiency.” Here they cite a paper which presents estimates for returns on research efforts for four software domains: computer vision, sampling efficiency in reinforcement learning, SAT solvers, and linear programming. These are all substantially more narrowly-defined than the very general capabilities required for improving AI researcher capability. Furthermore, the two machine learning related tasks (computer vision and sampling efficiency in RL) covered timespans of only ten and four years respectively. Furthermore, the paper in question is a methodological survey, and highlights that all presented estimates suffer from significant methodological shortcomings that are very difficult to overcome empirically. As such, this evidence is not a very convincing reason to think that doubling the number of ‘AI researchers’ working on improving AI will result in a self-sustaining software feedback loop for any significant period of time.
The Limitations of Benchmarks
MacAskill and Moorhouse also argue that individual AI systems are becoming rapidly more capable at performing research-related tasks, and will soon reach parity with human researchers. Specifically, they claim that within the next five years there will be ‘models which surpass the research ability of even the smartest human researchers, in basically every important cognitive domain’. Given the centrality of this claim to their overall case, they devote surprisingly little space to substantiating it. Indeed, their justification consists entirely of appeals to rapid increases in the performance of LLMs on various benchmark tasks. They cite GPQA (multiple choice questions covering PhD-level science topics), RE-Bench (machine learning optimisation coding tasks), and SWE-Bench (real-world software tasks). They also mention that LLMs can now ‘answer questions fluently and with more general knowledge than any living person.’
Exactly why improved performance on these tasks should warrant the conclusion that models will soon surpass research ability on ‘basically every important cognitive domain’ is not explained. As a cognitive science researcher, I find this level of analysis incredibly simplistic. The authors don’t explain what they mean by ‘cognitive domain’ or how they arrive at their conclusions about the capabilities of current LLMs compared to humans. Wikipedia has a nice list of cognitive capabilities, types of thinking, and domains of thought, and it seems to me that current LLMs have minimal ability to perform most of these reliably. Of course, my subjective look at such a list isn’t very convincing evidence of anything. But neither is the unexamined and often unarticulated claim that performance on coding problems, math tasks, and science multiple choice questions is somehow predictive of performance across the entire scope of human cognition. I am continually surprised at the willingness of EAs to make sweeping claims about the cognitive capabilities of LLMs with little to no theoretical or empirical analysis of human cognition or LLMs, other than a selection of machine learning benchmarks.
Beyond these general concerns, I documented in my earlier paper several major limitations with the use of benchmarks for assessing the performance of LLMs. Here I summarise the major issues:
- Tests should only be used to evaluate the capabilities of a person or model if they have been validated as successfully generalising to tasks beyond the test itself. Extensive research is conducted within cognitive psychology for human intelligence and other psychometric tests, but much less work has been done for LLM benchmarks. The research that has been conducted often shows limited generalisation and significant overfitting of models to benchmarks.
- Use of adversarial and interpretation techniques has repeatedly found that LLMs perform poorly on many tasks when more difficult examples are used. Further, models often do not use appropriate reasoning steps, instead confabulating explanations that seem plausible but do not actually account for the provided solution.
- LLMs often do not successfully generalise to versions of the task beyond those they were trained on. The models often use superficial heuristics and pattern-matching rather than genuine understanding or reasoning steps.
- The training data for many LLMs is contaminated with questions and solutions from known benchmarks, as well as synthetic data generated from such benchmarks. This is worsened by strong incentives of developers to fudge the training or evaluation process to achieve better benchmark results. Most recently, OpenAI has attracted criticism for their reporting of results on both the ARC-AI and the Frontier Math benchmarks.
Even more recent results corroborate these points. One recent analysis of the performance of LLMs on a new, and hence previously unseen, mathematics task found that “all tested models struggled significantly: only GEMINI-2.5-PRO achieves a non-trivial score of 25%, while all other models achieve less than 5%. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training.” A separate analysis of similar data found that models regularly rely on pattern recognition and heuristic shortcuts rather than engaging in genuine mathematical reasoning.
Real-World Adoption
One final issue pertains to the speed with which LLMs can be adapted to perform real-world tasks. MacAskill and Moorhouse discuss at length the possibility for ‘AI researchers’ to dramatically speed up the process of scientific research. However so far, the only example of a machine learning system performing a significant scientific research task is AlphaFold, a system designed to predict the structure of protein molecules given their amino acid sequence. In addition to being eight years old, AlphaFold does not solve the problem of protein folding. It is simply a tool for predicting protein structure, and even in that narrow task is has many limitations. LLMs are increasingly utilised in cognitive science research as an object of study in their own right, as well as providing a useful tool for text processing or data validation. However I am not aware of any examples of LLMs being applied to significantly accelerate any aspect of scientific research. Perhaps this will change rapidly within the next few years, but MacAskill and Moorhouse do not give any reasons for thinking so beyond generic appeals to increased performance on coding and multiple-choice benchmarks.
Other lines of evidence also indicate that the real-world impact of LLMs is modest. For instance, a large survey of workers in 11 exposed occupations in Denmark found effects of LLM adoption on earnings and hours worked of less than 1%. Similarly, a series of interviews of 19 policy analysts, academic researchers, and industry professionals who have used benchmarks to inform decisions regarding adoption or development of LLMs found that most respondents were skeptical of the relevance of benchmark performance for real-world tasks. As with past technologies, many factors including reliability problems, supply chain bottlenecks, organisational inertia, user training, and difficulty in adapting to specific use cases will mean that the real-world impacts of LLMs are likely to develop over the timespan of decades rather than a few years.
Conclusion
The coming few years will undoubtedly see continued progress and ongoing adoption of LLMs in various economic sectors. However, I find the case for 3-5 timelines for the development of AGI to be unconvincing. These arguments are overly dependent on simple explanations of existing trends while paying insufficient attention to the known limitations of such benchmarks. Similarly, I find that such arguments often rely on extensive speculation based primarily on science fiction scenarios and thought experiments, rather than careful modelling, historical parallels, or detailed consideration of the similarities and differences between LLMs and human cognition.
Thanks for writing this James, I think you raise good points, and for the most part I think you're accurately representing what we say in the piece. Some scattered thoughts —
I agree that you can't blithely assume scaling trends will continue, or that scaling “laws” will continue to hold. I don't think both assumptions are entirely unmotivated, because the trends have already spanned many orders of magnitude. E.g. pretraining compute has spanned ≈ 1e8 OOMs of FLOP since the introduction of the transformer, and scaling “laws” for given measures of performance hold up across almost as many OOMs, experimentally and practically. Presumably the trends have to break down eventually, but if all you knew were the trends so far, it seems reasonable to spread your guesses over many future OOMs.
Of course we know more than big-picture trends. As you say, GPT 4.5 seemed pretty disappointing, on benchmarks and just qualitatively. People at labs are making noises about pretraining scaling as we know it slowing down. I do think short timelines[1] look less likely than they did 6 months or a year ago. Progress is often made by a series of overlapping s-curves, and the source of future gains could come from elsewhere, like inference scaling. But we don't give specific reasons to expect that in the piece.
On the idea of a “software-only intelligence explosion”, this piece by Daniel Eth and Tom Davidson gives more detailed considerations which are specific to ML R&D. It's not totally obvious to me why you should expect returns to diminish faster in broad vs narrow domains. In very narrow domains (e.g. checkers-playing programs), you might intuitively think that the ‘well’ of creative ways to improve performance is shallower, so further improvements beyond the most obvious are less useful. That said, I'm not strongly sold on the “software-only intelligence explosion” idea; I think it's not at all guaranteed (maybe I'm 40% on it?) but worth having on the radar.
I also agree that benchmarks don't (necessarily) generalise well to real-world tasks. In fact I think this is one of the cruxes between people who expect AI progress to quickly transform the world in various ways. I do think the cognitive capabilities of frontier LLMs can meaningfully be described as “broad” in the sense that: if I tried to think of a long list of tasks and questions which span an intuitively broad range of “cognitive skills” before knowing about the skill distribution of LLMs[2] then I expect the best LLMs to perform well across the large majority of those tasks and questions.
But I agree there is a gap between impressive benchmark scores and real-world usefulness. Partly there's a kind of o-ring dynamic, where the speed of progress is gated by the slowest-moving critical processes — speeding up tasks that normally take 20% of a researcher's time by 1,000x won't speed them up by more than 25% if the other tasks can't be substituted. Partly there are specific competences which AI models don't have, like reliably carrying out long-range tasks. Partly (and relatedly) there's an issue of access to implicit knowledge and other bespoke training data (like examples of expert practitioners carrying out long time-horizon tasks) which are scarce in the public internet training data. And so on.
Note that (unless we badly misphrased something) we are not arguing for “AGI by 2030” in the piece you are discussing. As always, it depends on what “AGI” means, but if it means a world where tech progress is advancing across the board at 10x rates, or a time where there is basically no cognitive skill where any human still outperforms some AI, then I also think that's unlikely by 2030.
One thing I should emphasise, and maybe we should have emphasised more, is the possibility of faster (say 10x) growth rates from continued AI progress. Where I think the best accounts of why this could happen do not depend on AI becoming superhuman in some very comprehensive way, or on a software-only intelligence explosion happening. The main thing we were trying to argue is that sustained AI progress could significantly accelerate tech progress and growth, after the point where (if it ever happens) the contribution to research from AI reaches some kind of parity with humans. I do think this is plausible, though not overwhelmingly likely.
Thanks again for writing this.
That is, “AGI by 2030” or some other fixed calendar date.
Things like memorisation, reading comprehension, abstract reasoning, general knowledge, etc.