This is François Chollet's keynote talk at the AGI-24 conference in Seattle, Washington in August 2024. 

Chollet is an AI researcher who may be best known for creating the deep learning library Keras. He did deep learning research and software development for Keras at Google for nine years before leaving recently to create his own AI startup. In September 2024, Time named him one of the 100 most influential people in AI. 

In this talk, Chollet describes what he sees as the fundamental weaknesses in large language models (LLMs), the flaws in the commonly used benchmarks for LLM performance, and argues why LLMs are incapable of scaling to artificial general intelligence (AGI). He also argues that apparent progress by LLMs in some of their weak areas is the result of superficial, brittle fixes by human annotators, which is a non-scalable and labour-intensive approach. 

Chollet's opinion that LLMs won't scale to AGI appears to be the view of a majority of AI experts. A March 2025 report from the Association for the Advancement of Artificial Intelligence (AAAI) found the following after surveying 475 AI experts (page 63): 

The majority of respondents (76%) assert that “scaling up current AI approaches” to yield AGI is “unlikely” or “very unlikely” to succeed, suggesting doubts about whether current machine learning paradigms are sufficient for achieving general intelligence.

The approach to AGI that Chollet favours is a combination of deep learning and program synthesis


Related post: ARC-AGI-2 Overview With François Chollet

6

0
0

Reactions

0
0
Comments2


Sorted by Click to highlight new comments since:

With Chollet acknowledging that o1/o3 (and ARC 1 getting beaten) was a significant breakthrough, how much is this talk now outdated vs still relevant?

I think it’s still very relevant! I don’t think this talk’s relevance has diminished. It’s just important to also have that more recent information about o3 in addition to what’s in this talk. (That’s why I linked the other talk at the bottom of this post.)

By the way, I think it’s just o3 and not o1 that achieves the breakthrough results on ARC-AGI-1. It looks like o1 only gets 32% on ARC-AGI-1, whereas the lower-compute version of o3 gets around 76% and the higher-compute version gets around 87%.

The lower-compute version of o3 only gets 4% on ARC-AGI-2 in partial testing (full testing has not yet been done) and the higher-compute version has not yet been tested.

Chollet speculates in this blog post about how o3 works (I don’t think OpenAI has said much about this) and how that fits in to his overall thinking about LLMs and AGI:

Why does o3 score so much higher than o1? And why did o1 score so much higher than GPT-4o in the first place? I think this series of results provides invaluable data points for the ongoing pursuit of AGI.

My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.

This "memorize, fetch, apply" paradigm can achieve arbitrary levels of skills at arbitrary tasks given appropriate training data, but it cannot adapt to novelty or pick up new skills on the fly (which is to say that there is no fluid intelligence at play here.) This has been exemplified by the low performance of LLMs on ARC-AGI, the only benchmark specifically designed to measure adaptability to novelty – GPT-3 scored 0, GPT-4 scored near 0, GPT-4o got to 5%. Scaling up these models to the limits of what's possible wasn't getting ARC-AGI numbers anywhere near what basic brute enumeration could achieve years ago (up to 50%).

To adapt to novelty, you need two things. First, you need knowledge – a set of reusable functions or programs to draw upon. LLMs have more than enough of that. Second, you need the ability to recombine these functions into a brand new program when facing a new task – a program that models the task at hand. Program synthesis. LLMs have long lacked this feature. The o series of models fixes that.

For now, we can only speculate about the exact specifics of how o3 works. But o3's core mechanism appears to be natural language program search and execution within token space – at test time, the model searches over the space of possible Chains of Thought (CoTs) describing the steps required to solve the task, in a fashion perhaps not too dissimilar to AlphaZero-style Monte-Carlo tree search. In the case of o3, the search is presumably guided by some kind of evaluator model. To note, Demis Hassabis hinted back in a June 2023 interview that DeepMind had been researching this very idea – this line of work has been a long time coming.

So while single-generation LLMs struggle with novelty, o3 overcomes this by generating and executing its own programs, where the program itself (the CoT) becomes the artifact of knowledge recombination. Although this is not the only viable approach to test-time knowledge recombination (you could also do test-time training, or search in latent space), it represents the current state-of-the-art as per these new ARC-AGI numbers.

Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of "programs" (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.

There are however two significant differences between what's happening here and what I meant when I previously described "deep learning-guided program search" as the best path to get to AGI. Crucially, the programs generated by o3 are natural language instructions (to be "executed" by a LLM) rather than executable symbolic programs. This means two things. First, that they cannot make contact with reality via execution and direct evaluation on the task – instead, they must be evaluated for fitness via another model, and the evaluation, lacking such grounding, might go wrong when operating out of distribution. Second, the system cannot autonomously acquire the ability to generate and evaluate these programs (the way a system like AlphaZero can learn to play a board game on its own.) Instead, it is reliant on expert-labeled, human-generated CoT data.

It's not yet clear what the exact limitations of the new system are and how far it might scale. We'll need further testing to find out. Regardless, the current performance represents a remarkable achievement, and a clear confirmation that intuition-guided test-time search over program space is a powerful paradigm to build AI systems that can adapt to arbitrary tasks.

Curated and popular this week
 ·  · 1m read
 · 
I recently read a blog post that concluded with: > When I'm on my deathbed, I won't look back at my life and wish I had worked harder. I'll look back and wish I spent more time with the people I loved. Setting aside that some people don't have the economic breathing room to make this kind of tradeoff, what jumps out at me is the implication that you're not working on something important that you'll endorse in retrospect. I don't think the author is envisioning directly valuable work (reducing risk from international conflict, pandemics, or AI-supported totalitarianism; improving humanity's treatment of animals; fighting global poverty) or the undervalued less direct approach of earning money and donating it to enable others to work on pressing problems. Definitely spend time with your friends, family, and those you love. Don't work to the exclusion of everything else that matters in your life. But if your tens of thousands of hours at work aren't something you expect to look back on with pride, consider whether there's something else you could be doing professionally that you could feel good about.
 ·  · 14m read
 · 
Introduction In this post, I present what I believe to be an important yet underexplored argument that fundamentally challenges the promise of cultivated meat. In essence, there are compelling reasons to conclude that cultivated meat will not replace conventional meat, but will instead primarily compete with other alternative proteins that offer superior environmental and ethical benefits. Moreover, research into and promotion of cultivated meat may potentially result in a net negative impact. Beyond critique, I try to offer constructive recommendations for the EA movement. While I've kept this post concise, I'm more than willing to elaborate on any specific point upon request. Finally, I contacted a few GFI team members to ensure I wasn't making any major errors in this post, and I've tried to incorporate some of their nuances in response to their feedback. From industry to academia: my cultivated meat journey I'm currently in my fourth year (and hopefully final one!) of my PhD. My thesis examines the environmental and economic challenges associated with alternative proteins. I have three working papers on cultivated meat at various stages of development, though none have been published yet. Prior to beginning my doctoral studies, I spent two years at Gourmey, a cultivated meat startup. I frequently appear in French media discussing cultivated meat, often "defending" it in a media environment that tends to be hostile and where misinformation is widespread. For a considerable time, I was highly optimistic about cultivated meat, which was a significant factor in my decision to pursue doctoral research on this subject. However, in the last two years, my perspective regarding cultivated meat has evolved and become considerably more ambivalent. Motivations and epistemic status Although the hype has somewhat subsided and organizations like Open Philanthropy have expressed skepticism about cultivated meat, many people in the movement continue to place considerable hop
 ·  · 7m read
 · 
Introduction I have been writing posts critical of mainstream EA narratives about AI capabilities and timelines for many years now. Compared to the situation when I wrote my posts in 2018 or 2020, LLMs now dominate the discussion, and timelines have also shrunk enormously. The ‘mainstream view’ within EA now appears to be that human-level AI will be arriving by 2030, even as early as 2027. This view has been articulated by 80,000 Hours, on the forum (though see this excellent piece excellent piece arguing against short timelines), and in the highly engaging science fiction scenario of AI 2027. While my article piece is directed generally against all such short-horizon views, I will focus on responding to relevant portions of the article ‘Preparing for the Intelligence Explosion’ by Will MacAskill and Fin Moorhouse.  Rates of Growth The authors summarise their argument as follows: > Currently, total global research effort grows slowly, increasing at less than 5% per year. But total AI cognitive labour is growing more than 500x faster than total human cognitive labour, and this seems likely to remain true up to and beyond the point where the cognitive capabilities of AI surpasses all humans. So, once total AI cognitive labour starts to rival total human cognitive labour, the growth rate of overall cognitive labour will increase massively. That will drive faster technological progress. MacAskill and Moorhouse argue that increases in training compute, inference compute and algorithmic efficiency have been increasing at a rate of 25 times per year, compared to the number of human researchers which increases 0.04 times per year, hence the 500x faster rate of growth. This is an inapt comparison, because in the calculation the capabilities of ‘AI researchers’ are based on their access to compute and other performance improvements, while no such adjustment is made for human researchers, who also have access to more compute and other productivity enhancements each year.