Unsolved research problems on the road to AGI

Yarrow Bouchard 🔸

Artificial general intelligence (AGI) — an AI system that can think, plan, learn, and solve problems like a human does, at the level of competence a human does — is not just a matter of scaling up existing AI models (which, in any case, is forlorn), but requires several problems in fundamental AI research to be solved. Those problems are:

Hierarchical planning
Continual learning
Learning from video
Data efficiency
Generalization
Reliability

While large language models (LLMs) are considered by many people to represent major fundamental progress in AI, this is in an important sense not true at all — LLMs have made no progress on any of these problems. AGI remains as far away as the time it takes to solve each one, which is hard to predict because progress in basic science is hard to predict.

I will give a description of each of these problems.

Hierarchical planning

Hierarchical planning is the ability to plan complex tasks that contain nested hierarchies of other tasks. For example, the task "make a sandwich" includes the task "get some bread". "Get some bread" includes "walk to the kitchen". "Walk to the kitchen" includes "plan a path to the kitchen and watch out for obstacles (e.g. the cat lying on the kitchen floor)", which includes "step over the cat, but stop moving if the cat suddenly gets up". And so on.

Hierarchical reinforcement learning is one idea for how to solve hierarchical planning, but it remains an open research problem.

Continual learning

Currently, deep learning-based and deep reinforcement learning-based systems have a training phase and a test phase. A model is trained for some amount of time, then training is done, permanently, and the model is deployed into the world, at which point it stops learning, permanently. Continual learning would mean there is no longer a distinction between the training phase and the test phase. AI systems would always be learning, just as humans do.

Learning from video data

Large language models (LLMs) benefit from the inherent structure in text data. Words or tokens are clean, crisp compositional units of text data that have no counterpoint in video data. Pixels are far too granular and a single pixel lacks semantic meaning in the way a word or token has it. Text prediction has a natural form: LLMs can predict the probability of the next word or token and, say, rank the five most likely words/tokens to be next in the sequence and assign a percentage probability to each. No natural form for video prediction exists. Trying to predict video from pixels to pixels leads to an uncontainable combinatorial explosion.

Effective video prediction probably requires a semantic and conceptual understanding of what's happening in the world, e.g. predicting plausible outcomes for video of a car driving up to a red stoplight requires knowing cars are things that drive on roads, that gravity keeps heavy objects stuck to the ground, that red means stop, etc. Conversely, effective video prediction techniques may help AI models gain this sort of conceptual and semantic understanding of the world.

Data efficiency

Humans frequently learn from zero examples, one example, two examples, or just a few. Deep learning models often require hundreds or thousands of training examples to get a competent grasp on a concept, e.g. a model must train on 1,000 photos of bananas to be able to classify photos of bananas with 91% accuracy. Models that learn via deep reinforcement learning require massive amounts of trial-and-error experience to learn skills.

For example, DeepMind's system AlphaStar couldn't learn how to play StarCraft II using reinforcement learning from scratch (this is related to difficulties with hierarchical planning). First, it required 971,000 examples of human-played games to imitation learn from. If the average game is 10 minutes, this is equivalent to 18.5 years of continuous play — something like three to four orders of magnitude more than humans require to learn to play at a comparable level of skill. Second, to attain Grandmaster-level skill, AlphaStar did 60,000 years of training via self-play, which is at least three orders of magnitude more than professional StarCraft II players could have played during their lifetime, even assuming all their waking hours since birth have been devoted to playing.

Generalization

Generalization is the ability for an AI system to understand concepts or information that isn’t well-represented in its training data. For example, can a convolutional neural network that has been trained on 1,000 labelled photos of bananas recognize an image of a cartoon banana? Can AlphaStar respond competently if a Grandmaster-level human player tries a new strategy or tactic that wasn’t in the 971,000 recordings of human-played games it trained on, and that didn’t come up during self-play?

Generalization is the holy grail of AI research. Current AI systems do not generalize well, at least not compared to humans, or even other mammals. AI systems are brittle, meaning they quickly fall apart when challenged with novel problems or situations.

Generalization is not to be confused with the ability to do a lot of different things. For instance, DeepMind’s model MuZero can play 57 Atari games, but wouldn’t be able to play a 58th Atari game you presented it with. MuZero can’t generalize from what it’s learned to something novel.

Reliability

The AI researcher Ilya Sutskever, a co-founder and former chief scientist of OpenAI, as well as a co-author of the breakthrough AlexNet paper in 2012, has identified reliability as possibly the hardest challenge for deep learning to overcome. In a 2023 appearance on Dwarkesh Patel’s podcast, Sutskever had the following remarks:

…there is this effect where optimistic people who are working on the technology tend to underestimate the time it takes to get there. But the way I ground myself is by thinking about the self-driving car. In particular, there is an analogy where if you look at the — so I have a Tesla, and if you look at the self-driving behavior of it, it looks like it does everything. It does everything. But it's also clear that there is still a long way to go in terms of reliability. And we might be in a similar place with respect to our models where it also looks like we can do everything, and at the same time, we will need to do some more work until we really iron out all the issues and make it really good and really reliable and robust and well-behaved.

Patel then asked what would be the most likely cause if the economic value of LLMs turned out to be disappointing. Sutskever again flagged reliability:

I really don't think that's a likely possibility, so that's the preface to the comment. But if I were to take the premise of your question, well, why were things disappointing in terms of real-world impact? My answer would be reliability. If somehow it ends up being the case that you really want them to be reliable and they ended up not being reliable, or if reliability turned out to be harder than we expect.

I really don't think that will be the case. But if I had to pick one and you were telling me — hey, why didn't things work out? It would be reliability. That you still have to look over the answers and double-check everything. That just really puts a damper on the economic value that can be produced by those systems.

In many cases, the question of economic importance is not what an AI system can do correctly 50% of the time or 90% of the time, but 99.999% of the time, if that's the sort of reliability humans have on such tasks. Self-driving cars are the prime example of this, but the same idea applies to LLMs. If a company wants to use LLMs to summarize financial documents, for instance, mistakes in the summaries are costly to correct, since a human must check the summaries for accuracy. If mistakes slip past human reviewers and aren't corrected, there is a risk bad information could lead to even more costly mistakes down the line.

Getting reliability from 90% to 99% and then to 99.9% and so on is often intractable in practice. Deep learning scaling trends indicate exponentially more training data is required to improve models' accuracy. Obtaining large-scale training data for self-driving cars is expensive and carries safety concerns. In the case of LLMs, the training data is nearly exhausted.

Implications for AGI timelines

I agree with the physicist David Deutsch’s argument that inductively extrapolating a trend forward without an explanatory theory of what is causing that trend is equivalent to a belief in magic. In the case of AI progress, what induction actually implies is, additionally, ambiguous.

If you have a vague sense that AI has been making a lot of progress, you could simply extrapolate that AI will continue to make a lot of progress, and then use your imagination to fill in the gaps of what that means, specifically. If you look at specific problems like those described above and note that minimal progress on these problems has been made in recent years, you could extrapolate this forward and infer that solving any of them will take centuries. But neither of these is a sensible approach to making predictions; making predictions just by extrapolating a trend without understanding why that trend is happening isn’t sensible in the first place.

My conclusion is that the time away from AGI is highly uncertain, but I can get a vague intuition for it based on research progress on these basic science problems. Solving all of these problems in the very near future seems unlikely, so it seems that AGI is very unlikely within, say, the next five to ten years. However, beyond the very near future, the years laid out before us quickly slip into a fog of uncertainty. To try to say whether AGI is 30 years away or 60 or 90 or 120 seems impossible in a quite fundamental sense. It seems completely forlorn to even try to guess. We do not understand the basic science involved, and we do not understand how it will eventually be figured out. There is no precedent for anyone ever correctly predicting anything like this (except maybe through sheer luck).

Tracking AGI progress

To track AGI progress, one should not rely on text-based question and answer benchmarks for LLMs or similarly simplistic and uninformative indicators. One should track research progress on each one of these fundamental research problems. This is not something one can readily quantify, but I challenge the assumption that we should expect to be able to readily quantify science or scientific progress. Premature operationalization (e.g. defining happiness as daily smile frequency) is the bane of good conceptual understanding. Underlying the desire to quantify is the desire for rigour, but operationalizing an informal concept with a sloppy, oversimplifying proxy is the opposite of rigour. This should be sternly frowned upon — the equivalent of making unsupported assertions.

It may be possible to eventually quantify the thing you want to measure, but this requires patience. First, understand the thing you want to measure as deeply as possible. Then try to imagine ways it could possibly be quantified. Some people want to rush to the second part, but this can only end badly. If you don’t understand what you’re trying to measure, you won’t measure it well.

For now, the best way to track AGI progress remains qualitative. Look at the research and see how much progress is being made. There will be measurements involved, but that won’t be the whole story. Moreover, figuring out which measurements matter and which ones don’t will be a complex reasoning process, requiring philosophical reflection on how to formalize informal concepts.

Conclusion

The amount of progress in fundamental AI research required for AGI has been widely ignored or underestimated in discussions around forecasting AGI. Discussions often proceed on the assumption that scaling will lead to AGI, which is false, or that progress on fundamental research problems has been steady, rapid, and continuous, which is also false.

There is also a kind of supernatural thinking prevalent in some discussions, in which the idea of AGI inventing itself is seriously discussed — but this is an impossibility. For an AI system to invent anything, first humans must invent an AI system with the ability to invent things. To the extent people think AI systems can already do this or are just on the cusp of it, they misunderstand current AI capabilities and limitations.

Qualitative impressions of the amount of research progress on the problems I described above may vary from person to person. What seems clear, in any case, is that people who agree with my account of the research obstacles on the road to AGI will tend to think that near-term AGI is unlikely or at least very uncertain.

10 Reactions

Mentioned in

20Highlights from Ilya Sutskever’s November 2025 interview with Dwarkesh Patel

More posts like this

Comments12

Sorted by

New & upvoted

Click to highlight new comments since: Today at 6:32 AM

Ben_West🔸Nov 268

LLMs have made no progress on any of these problems

Can we bet on this? I propose: we give a video model of your choice from 2023 and one of my choice from 2025 two prompts (one your choice, one my choice) then ask some neutral panel of judges (I'm happy to just ask random people in a coffee shop) which model produced more realistic videos.

Yarrow Bouchard 🔸Nov 260

I didn’t say that pixel-to-pixel prediction or other low-level techniques haven’t made incremental progress. I said that this approach is ultimately forlorn — if the goal is human-level computer vision for robotics applications or AGI that can see — and LLMs didn’t make any progress on any alternative approaches.

Ben_West🔸Nov 272

What are examples of what you would consider to be progress on "effective video prediction"?

Yarrow Bouchard 🔸Nov 270

Possibly something like V-JEPA 2, but in that case I'm just going off of Meta touting its own results, and I would want to hear opinions from independent experts.

Ben_West🔸Nov 274

Sorry, I don't mean models that you consider to be better, but rather metrics/behaviors. Like what can V-JEPA-2 (or any model) do that previous models couldn't which you would consider to be a sign of progress?

Yarrow Bouchard 🔸Nov 27*1

The V-JEPA 2 abstract explains this:

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

Again, the caveat here is that this is Meta touting their own results, so I take it with a grain of salt.

I don't think higher scores on the benchmarks mentioned automatically imply progress on the underlying technical challenge. It's more about the underlying technical ideas in V-JEPA 2 — Yann LeCun has explained the rationale for these ideas — and where they could ultimately go given further research.

I'm very skeptical of AI benchmarks in general because I tend to think they have poor construct validity, depending how you interpret them, i.e., insofar as they attempt to measure cognitive abilities or aspects of general intelligence, they mostly don't measure those things successfully.

The clearest and crudest example to illustrate this point is LLM performance on IQ tests. The naive interpretation is that if an LLM scores above average on an IQ test, i.e., above 100, then it must have the cognitive properties a human does when they score above average on an IQ test, that is, such an LLM must be a general intelligence. But many LLMs, such as GPT-4 and Claude 3 Opus, score well above 100 on IQ tests. Are GPT-4 and Claude 3 Opus therefore AGIs? No, of course not. So, IQ tests don't have construct validity when applied to LLMs if you think IQ tests measure general intelligence for AI systems.

I don't think anybody really believes IQ tests actually prove LLMs are AGIs, which is why it's a useful example. But people often do use benchmarks to compare LLM intelligence to human intelligence based on similar reasoning. I don't think the reasoning is any more valid with those benchmarks than it is for IQ tests.

Benchmarks are useful for measuring certain things; I'm not trying to argue with narrow interpretations. I'm specifically arguing with the use of benchmarks to put general intelligence on a number line, such that a lower score on a benchmark means an AI system is further away from general intelligence and a higher score means it is closer to general intelligence. This isn't valid with IQ tests and it isn't valid with most benchmarks.

Researchers can validly use benchmarks as a measure of performance, but I want to ward against the overboard interpretation of benchmarks, as if they were scientific tests of cognitive ability or general intelligence — which they aren't.

Just one example of what I mean: if you show AI models an image of a 3D model of an object, such as a folding chair, in a typical pose, they will correctly classify the object 99.6% of the time. You might conclude: these AI models have a good visual understanding of these objects, of what they are, of how they look. But if you just rotate the 3D models into an atypical pose, such as showing the folding chair upside-down, object recognition accuracy drops to 67.1%. The error rate increases by 82x from 0.4% to 32.9%. (Humans perform equally well regardless of whether the pose is typical or atypical — good robustness!)

Usually, when we measure AI performance on some dataset or some set of tasks, we don't do this kind of perturbation to test robustness. And this is just one way you can call the construct validity of benchmarks into question. (If benchmarks are being construed more broadly than their creators probably intend, in most cases.)

Economic performance is a more robust test of AI capabilities than almost anything else. However, it's also a harsh and unforgiving test, which doesn't allow us to measure early progress.

tobycrisford 🔸Nov 243

LLMs have made no progress on any of these problems.

I think this probably overstates things? For example, o3 was able to achieve human level performance on ARC-AGI-1, which I think counts as at least some kind of progress on the problems of generalization and data efficiency?

Yarrow Bouchard 🔸Nov 240

Why?

tobycrisford 🔸Nov 245

Because the ARC benchmark was specifically designed to be a test of general intelligence (do you disagree that it successfully achieves this?) and because each problem takes the form of requiring you to spot a pattern from only a couple of examples.

Yarrow Bouchard 🔸Nov 248

I was excited^[1] about o3's performance on ARC-AGI-1 initially, but then I read these tweets from Toby Ord:

Finally, I want to note how preposterous the o3-high attempt was. It took 1,024 attempts at each task, writing about 137 pages of text for each attempt, or about 43 million words total. That's writing an Encyclopedia Brittanica (44 million words) per task!

And costing about $30,000 for each task. For reference, these are simple puzzles that my 10-year-old child can solve in about 4 minutes. That's *something* but not how intelligence solves the puzzle.

This is how François Chollet, the creator of ARC-AGI-1, characterized o3's results:

I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.

It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.

ARC-AGI-1 and ARC-AGI-2 are the most interesting LLM benchmarks (and I'm sure ARC-AGI-3 will be very interesting when it comes out). The results of o3 (and other LLMs since) on these benchmarks are quite interesting. However, now that I know more detail, I have a hard time getting excited about LLMs' performance.

I'm not sure Chollet's "zero to one" framing makes sense unless he's just talking about LLMs — which I guess he probably is. It seems like there's been a mustard seed of generalization or fluid intelligence in deep learning-based and deep reinforcement learning-based systems for a long time. Maybe go aficionados are just reading too much into stochastic behaviour, but a lot of people were impressed with some of AlphaGo's moves, and called them creative and surprising. That was back in 2016.

If you think AI had literally zero generalization or zero fluid intelligence before o3 and then o3 demonstrated a tiny amount, that's potentially very exciting. Chollet framing the results in this way is why I was initially excited about o3. But if you think AI has had a tiny amount of generalization or fluid intelligence for a long time and continues to have a tiny amount, the result is much less exciting — although it's still a fascinating case study to contemplate.

^{^}
I say excited and not scared because I think AI is a good thing and not risky.

tobycrisford 🔸Nov 245

I don't disagree with much of this comment (to the extent that it puts o3's achievement in its proper context), but I think this is still inconsistent with your original "no progress" claim (whether the progress happened pre or post o3's ARC performance isn't really relevant). I suppose your point is that the "seed of generalization" that LLMs contain is so insignificant that it can be rounded to zero for practical purposes? That was true pre o3 and is still true now? Is that a fair summary of your position? I still think "no progress" is too bold!

But in addition, I think I also disagree with you that there is nothing exciting about o3's ARC performance.

It seems obvious that LLMs have always had some ability to generalize. Any time that they produce a coherent response that has not appeared verbatim in their training data, they are doing some kind of generalisation. And I think even Chollet has always acknowledged that too. I've heard him characterize LLMs (pre ARC success) as combining dense sampling of the problem space with an extremely weak ability to generalize, contrasting that with the ability of humans to learn from only a few examples. But there is still an acknowledgement here that some non-zero generalization is happening.

But if this is your model of how LLMs work, that their ability to generalize is extremely weak, then you don't expect them to be able to solve ARC problems. They shouldn't be able to solve ARC problems even if they had access to unlimited inference time compute. Ok, so o3 had 1,024 attempts at each task, but that doesn't mean it tried the task 1,024 times until it hit on the correct answer. That would be cheating. It means it tried the task 1,024 times and then did some statistics on all of its solutions before providing a single guess, which turned out to be right most of the time!

I think it is surprising and impressive that this worked! This wouldn't have worked with GPT-3. You could have given it chain of thought prompting, let it write as much as it wanted per attempt, and given it a trillion attempts at each problem, but I still don't think you would expect to find the correct answer dropping out at the end. In at least this sense, o3 was a genuine improvement in generalization ability.

And Chollet thought it was impressive too, describing it as a "genuine breakthrough", despite all the caveats that go with that (that you've already quoted).

When LLMs can solve a task, but only with masses of training data, then I think it is fair to contrast their data efficiency with that of humans and write off their intelligence as memorization rather than generalization. But when they can only solve a task by expending masses of inference time compute, I think it is harder to write that off in the same way. Mainly because: we don't really know how much inference time compute humans are using! (I don't think? Unless we understand the brain a lot better than I thought we did). I wouldn't be surprised at all if we find that AGI requires spending a lot of inference-time compute. I don't think that would make it any less AGI.

The exteme inference time compute costs are really important context to bear in mind when forecasting how AI progress is going to go, and what kinds of things are going to be possible. But I don't think it provides a reason to describe the intelligence as not "general", in the way that extreme data inefficiency does.

Yarrow Bouchard 🔸Nov 242

All deep learning systems since 2012 have had some extremely limited generalization ability. If you show AlexNet a picture of an object in a class it has trained on with some novel differences, e.g. maybe it’s black-and-white or upside-down or the dog in the photo is wearing a party hat, it will still do much better than chance at classifying the image. In an extremely limited sense, that is generalization.

I’m not sure I can agree with Chollet’s "zero to one" characterization of o3. To be clear, he’s saying it’s zero to one for fluid intelligence, not generalization, which is a related and similar concept but Chollet defines it a bit differently than generalization. Still, I’m not sure I can agree it’s zero to one either with regard to generalization or fluid intelligence. And I’m not sure I can agree it’s zero to one even for LLMs. It depends how strict you are about the definitions, how exact you are trying to be, and what, substantively, you’re trying to say.

I think many results in AI are incredibly impressive considered from the perspective of science and technology — everything from AlexNet to AlphaGo to ChatGPT. But this is a separate question from whether they are closing the gap between AI and general intelligence in any meaningful way. My assessment is that they’re not. I think o3’s performance on ARC-AGI-1 and now ARC-AGI-2 is a really cool science project, but it doesn’t feel like fundamental research progress on AGI (except in a really limited, narrow sense in which a lot of things would count as that).

AI systems can improve data efficiency, generalization, and other performance characteristics in incremental ways over previous systems and this can still be true. The best image classifiers today get better top-1 performance on ImageNet than the best image classifiers ten years ago, and so they are more data efficient. But it’s still true that the image classifiers of 2025 are no closer than the image classifiers of 2015 to emulating the proficiency with which humans see or other mammals see.

The old analogy — I think Douglas Hofstadter said this — is that if you climb a tree, you are closer to the Moon than your friend on the ground, but no closer to actually getting to the Moon than they are.

In some very technical sense, essentially any improvement to AI in any domain could be considered as an improvement to data efficiency, generalization, and reliability. If AI is able to do a new kind of task it wasn’t able to do before, its performance along all three characteristics has increased from zero to something. If it was already able but now it’s better at it, then its performance has increased from something to something more. But this is such a technicality and misses the substantive point.