The news:
ARC Prize published a blog post on April 22, 2025 that says OpenAI's o3 (Medium) scores 2.9% on the ARC-AGI-2 benchmark.[1] As of today, the leaderboard says that o3 (Medium) scores 3.0%. The blog post says o4-mini (Medium) scores 2.3% on ARC-AGI-2 and the leaderboard says it scores 2.4%.
The high-compute versions of the models "failed to respond or timed out" for the large majority of tasks.
The average score for humans — typical humans off the street — is 60%. All of the ARC-AGI-2 tasks have been solved by at least two humans in no more than two attempts.
From the recent blog post:
ARC Prize Foundation is a nonprofit committed to serving as the North Star for AGI by building open reasoning benchmarks that highlight the gap between what’s easy for humans and hard for AI. The ARC‑AGI benchmark family is our primary tool to do this. Every major model we evaluate adds new datapoints to the community’s understanding of where the frontier stands and how fast it is moving.
In this post we share the first public look at how OpenAI’s newest o‑series models, o3 and o4‑mini, perform on ARC‑AGI.
Our testing shows:
- o3 performs well on ARC-AGI-1 - o3-low scored 41% on the ARC-AGI-1 Semi Private Eval set, and the o3-medium reached 53%. Neither surpassed 3% on ARC‑AGI‑2.
- o4-mini shows promise - o4-mini-low scored 21% on ARC-AGI-1 Semi Private Eval, and o4-mini-medium` scored 41% at state of the art levels of efficiency. Again, both low/med scored under 3% on the more difficult ARC-AGI-2 set.
- Incomplete coverage with high reasoning - Both o3 and o4-mini frequently failed to return outputs when run at “high” reasoning. Partial high‑reasoning results appear below. However, these runs were excluded from the leaderboard due to insufficient coverage.
My analysis:
This is clear evidence that cutting-edge AI models have far less than human-level general intelligence.
To be clear, scoring at human-level or higher on ARC-AGI-2 isn't evidence of human-level general intelligence and isn't intended to be. It's simply meant to be a challenging benchmark for AI models that attempts to measure models' ability to generalize to novel problems, rather than to rely on memorization to solve problems.
By analogy, o4-mini's inability to play hangman is a sign that it's far from artificial general intelligence (AGI), but if o5-mini or a future version of o4-mini is able to play hangman, that wouldn't be a sign that it is AGI.
This is also conclusive disconfirmation (as if we needed it!) of the economist Tyler Cowen's declaration that o3 is AGI. (He followed up a day later and said, "I don’t mind if you don’t want to call it AGI." But he didn't say he was wrong to call it AGI.)
It is inevitable that over the next 5 years, many people will realize their belief that AGI will be created within the next 5 years is wrong. (Though not necessarily all, since, as Tyler Cowen showed, it is possible to declare that an AI model is AGI when it is clearly not. To avoid admitting to being wrong, in 2027 or 2029 or 2030 or whenever they predicted AGI would happen, people can just declare the latest AI model from that year to be AGI.) ARC-AGI-2 and, later on, ARC-AGI-3 can serve as a clear reminder that frontier AI models are not AGI, are not close to AGI, and continue to struggle with relatively simple problems that are easy for humans.
If you imagine fast enough progress, then no matter how far current AI systems are from AGI, it's possible to imagine them progressing from the current level of capabilities to AGI in incredibly small spans of time. But there is no reason to think progress will be fast enough to cover the ground from o3 (or any other frontier AI model) to AGI within 5 years.
The models that exist today are somewhat better than the models that existed 2 years ago, but only somewhat. In 2 years, the models will probably be somewhat better than today, but only somewhat.
It's hard to quantify general intelligence in a way that allows apples-to-apples comparisons between humans and machines. If we measure general intelligence by measuring the ability to play grandmaster-level chess, well, IBM's Deep Blue could do that in 1996. If we give ChatGPT an IQ test, it will score well above 100, the average for humans. Large language models (LLMs) are good at taking written tests and exams, which is what a lot of popular benchmarks are.
So, when I say today's AI models are somewhat better than AI models from 2 years ago, that's an informal, subjective evaluation based on casual observation and intuition. I don't have a way to quantify intelligence. Unfortunately, no one does.
In lieu of quantifying intelligence, I think pointing to the kinds of problems frontier AI models can't solve — problems which are easy for humans — and pointing to slow (or non-existent) progress in those areas is strong enough evidence against very near-term AGI. For example, o3 only gets 3% on ARC-AGI-2, o4-mini can't play hangman, and, after the last 2 years of progress, models are still hallucinating a lot and still struggling to understand time, causality, and other simple concepts. They have very little capacity to do hierarchical planning. There's been a little bit of improvement on these things, but not much.
Watch the ARC-AGI-2 leaderboard (and, later on, the ARC-AGI-3 leaderboard) over the coming years. It will be a better way to quantify progress toward AGI than any other benchmark or metric I'm currently aware of, basically all of which seem almost entirely unhelpful for measuring AGI progress. (Again, with the caveat that solving ARC-AGI-2 doesn't mean a system is AGI, but failure to solve it means a system isn't AGI.) I have no idea how long it will take to solve ARC-AGI-2 (or ARC-AGI-3), but I suspect we will roll past the deadline for at least one attention-grabbing prediction of very near-term AGI before it is solved.[2]
- ^
For context, read ARC Prize's blog post from March 24, 2025 announcing and explaining the ARC-AGI-2 benchmark. I also liked this video explaining ARC-AGI-2.
- ^
For example, Elon Musk has absurdly predicted that AGI will be created by the end of 2025, and I wouldn't be at all surprised if on January 1, 2026, the top score on ARC-AGI-2 is still below 60%.
Huh interesting, I just tried that direction and it worked fine as well. This isn't super important but if you wanted to share the conversation I'd be interested to see the prompt you used.