Why Doesn't 80,000 Hours Have its Own Community?

Zach Roush

EA Forum Bot Site
EA Forum

Comment Permalink

That's a really broad question though. If you asked something like, which system unlocked the most real-world value in coding, people would probably say the jump to a more recent model like o3-mini or Gemini 2.5

You could similarly argue the jump from infant to toddler is much more profound in terms of general capabilities than college student to phd but the latter is more relevant in terms of unlocking new research tasks that can be done.

See in context

On January 1, 2030, there will be no AGI (and AGI will still not be imminent)

by Yarrow🔸

Apr 62 min read 53

35

AI safetyCause prioritizationExistential riskForecastingAI risk skepticismAI forecastingArtificial intelligenceOpinionTransformative artificial intelligence

Frontpage

On January 1, 2030, there will be no artificial general intelligence (AGI) and AGI will still not be imminent.

A few reasons why I think this:

-If you look at easy benchmarks like ARC-AGI and ARC-AGI-2 that are easy for humans to solve and intentionally designed to be a low bar for AI to clear, the weaknesses of frontier AI models are starkly revealed.^[1]

-Casual, everyday use of large language models (LLMs) reveals major errors on simple thinking tasks, such as not understanding that an event that took place in 2025 could not have caused an event that took place in 2024.

-Progress does not seem like a fast exponential trend, faster than Moore's law and laying the groundwork for an intelligence explosion. Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to be train models is probably happening faster than Moore's law, but not the actual intelligence of the models.^[2]

-Most AI experts and most superforecasters give much more conservative predictions when surveyed about AGI, closer to 50 or 100 years than 5 or 10 years.^[3]

-Most AI experts are skeptical that scaling up LLMs could lead to AGI.^[4]

-It seems like there are deep, fundamental scientific discoveries and breakthroughs that would need to be made for building AGI to become possible. There is no evidence we're on the cusp of those happening and it seems like they could easily take many decades.

-Some of the well-known people who are making aggressive predictions about the timeline of AGI now have also made aggressive predictions about the timeline of AGI in the past that were wrong.^[5]

-The stock market doesn't think AGI is coming in 5 years.^[6]

-There has been little if any clear, observable effect of AI on economic productivity or the productivity of individual firms.^[7]

-AI can't yet replace human translators or do other jobs that it seems best-positioned to overtake.

-Progress on AI robotics problems, such as fully autonomous driving, has been dismal. (However, autonomous driving companies have good PR and marketing right up until the day they announce they're shutting down.)

-Discourse about AGI sounds way too millennialist and that's a reason for skepticism.

-The community of people most focused on keeping up the drumbeat of near-term AGI predictions seems insular, intolerant of disagreement or intellectual or social non-conformity (relative to the group's norms), and closed-off to even reasonable, relatively gentle criticism (whether or not they pay lip service to listening to criticism or perform being open-minded). It doesn't feel like a scientific community. It feels more like a niche subculture. It seems like a group of people just saying increasingly small numbers to each other (10 years, 5 years, 3 years, 2 years), hyping each other up (either with excitement or anxiety), and reinforcing each other's ideas all the time. It doesn't seem like an intellectually healthy community.

-A lot of the aforementioned points have been made before and there haven't been any good answers to them.

I'd like to thank Sam Altman, Dario Amodei, Demis Hassabis, Yann LeCun, Elon Musk, and several others who declined to be named for giving me notes on each of the sixteen drafts of this post I shared with them over the past three months. Your feedback helped me polish a rough stone of thought into a diamond of incisive criticism.^[8]

Note: I edited this post on 2025-04-12 at 20:30 UTC to add some footnotes.

^{^}
This video is a good introduction to these benchmarks. If you prefer to read, this blog post is another good introduction. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
^{^}
I realized after thinking about it more that trying to guess whether the general intelligence of AI models has been increasing slower or faster than Moore's law from November 2022 to April 2025 is probably not a helpful exercise. I explain why in three sequential comments here, here, and here, and in that third comment, I re-write this paragraph to convey my intended meaning better. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)
^{^}
This article gives some examples of more conservative predictions. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)

^{^}

The source for this claim is this 2025 report from the Association for the Advancement of Artificial Intelligence. This comment has more details. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)

^{^}

I gave an example in a comment here. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)

^{^}

I should have said "financial markets" rather than "the stock market". Anyway, there is a paper about this topic here (see the last appendix) and an EA Forum post about this topic here. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)

^{^}

After making this post, I found this paper that looks at the productivity impact of LLMs on people working in customer support. I pull an interesting quote from the study in this comment. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)

^{^}

This last paragraph with my "acknowledgements" is a joke, but the rest of the post isn't a joke. (I edited this post on 2025-04-12 at 20:30 UTC to add this footnote.)

Show all footnotes

35 Reactions

More posts like this

Comments53

Sorted by

New & upvoted

Click to highlight new comments since: Today at 7:39 AM

Ben_West🔸Apr 628

Progress does not seem like a fast exponential trend, faster than Moore's law and laying the groundwork for an intelligence explosion

Moore's law is ~1 doubling every 2 years. Barnes' law is ~4 doublings every 2 years:

titotalApr 617

I think if you surveyed any expert on LLMs and asked them "which was a greater jump in capabilities, Gpt2 to GPT3 or GPT3 to GPT4?" the vast majority would say the former, and I would agree with them. This graph doesn't capture that, which makes me cautious about overelying on it.

PeterApr 714

[comment deleted]Apr 62

Deleted by Ben_West🔸, 04/06/2025

Ben_West🔸Apr 67

I would be curious to know what the best benchmarks are which show a sub-Moore's-law trend.

Yarrow🔸Apr 73

Pay attention to the rest of that paragraph you quoted from:

Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to be train models is probably happening faster than Moore's law, but not the actual intelligence of the models.

Measuring intelligence is hard. On the wrong benchmark, a calculator is superintelligent. And yet a calculator lacks what we talk about when we talk about human intelligence, animal intelligence, and hypothetical future artificial general intelligence, like the robots and androids and sentient supercomputers that populate sci-fi.

I don't think ARC-AGI-2 is some perfect encapsulation of the essence of intelligence. It's more or less a puzzle game. But it's refreshing in that it does more than many benchmarks in teasing out some of the differences in intellectual capability between present-day deep neural networks and ordinary humans.

ARC-AGI-2 does not attempt to be a test of whether an AI system is an AGI or not. It's intended to be a low bar for AI systems to clear. The idea is to make it easy enough for AI systems that they have some hope of getting a high score within the next few years because the goal is to move AI research forward (and not just prove a point about artificial intelligence vs. human intelligence or something like that). So, getting a high score on ARC-AGI-2 would show incremental progress toward AGI; not getting a high score on ARC-AGI-2 over the next several years would show slow progress or a lack of progress toward AGI. (No result, even a score of 100%, as cool and impressive as that would be, would show that an AI system is AGI.)

Badly operationalizing a concept like "intelligence" is worse than not operationalizing it at all. If you operationalize "happiness" as "the number of times a person smiles per day", you've actually gone backwards in your understanding of happiness and would have been better off sticking to a looser, more nebulous conceptualization. To the extent we want to measure such complex and puzzling phenomena, we need really carefully designed measurement tools.

When we're measuring AI, the selection of which tasks we're evaluating on really matters. On the sort of tasks that frontier AI models struggle with, the length of tasks that AI can successfully do has not been reliably doubling. If you drew a chart for the GPT models on ARC-AGI-2, it would mostly just be a flat line. These are the results:

GPT-2: 0.0%
GPT-3: 0.0%
GPT-3.5: 0.0%
GPT-4: 0.0%
GPT-4o: 0.0%
GPT-4.5: 0.0%
o3-mini-high: 0.0%

It's only with the o3-low and o1-pro models we see scores above 0% — but still below 5%. Getting above 0% on ARC-AGI-2 is an interesting result and getting much higher scores on the previous version of the benchmark, ARC-AGI, is an interesting result. There's a nuanced discussion to be had about that topic. But I don't see how you could use these results to draw a trendline of AI models rapidly barrelling toward AGI.

Ben_West🔸Apr 8*12

If you drew a chart for the GPT models on ARC-AGI-2, it would mostly just be a flat line.. It's only with the o3-low and o1-pro models we see scores above 0%

... which is what (super)-exponential growth looks like, yes?

Specifically: We've gone from o1 (low) getting 0.8% to o3 (low) getting 4% in ~1 year, which is ~2 doublings per year (i.e. 4x Moore's law). Forecasting from this few data points sure seems like a cursed endeavor to me, but if you want to do it then I don't see how you can rule out Moore's-law-or-faster growth.

Yarrow🔸Apr 91

By some accounts, growth from 0.0 to 4.0 is infinite growth, which is infinitely faster than Moore’s law!

More seriously, I didn’t really think through precisely whether artificial intelligence could be increasing faster than Moore’s law. I guess in theory it could. I forgot that Moore’s law speed actually isn’t that impressive on its own. It has to compound over decades to be impressive.

If I eat a sandwich today and eat two sandwiches tomorrow, the growth rate in my sandwich consumption is astronomically faster than Moore’s law. But what matters is if the growth rate continues and compounds long-term.

The bigger picture is how to measure general intelligence or “fluid intelligence” in a way that makes sense. The Elo rating of AlphaGo probably increased faster than Moore’s law from 2014 to 2017. But we don’t see the Elo rating of AlphaGo as a measure of AGI, or else AGI would have already been achieved in 2015.

I think essentially all of these benchmarks and metrics for LLM performance are like the Elo rating of AlphaGo in this respect. They are measuring a narrow skill.

Ben_West🔸Apr 910

More seriously, I didn’t really think through precisely whether artificial intelligence could be increasing faster than Moore’s law.

Fair enough, but in that case I feel kind of confused about what your statement "Progress does not seem like a fast exponential trend, faster than Moore's law" was intended to imply.

If the claim you are making is "AGI by 2030 will require some growth faster than Moore's law" then the good news is that almost everyone agrees with you but the bad news is that everyone already agrees with you so this point is not really cruxy to anyone.

Maybe you have an additional claim like "...and growth faster than moore's law is unlikely?" If so, I would encourage you to write that because I think that is the kind of thing that would engage with people's cruxes!

Yarrow🔸Apr 10*1

So, what I originally wrote is:

Progress does not seem like a fast exponential trend, faster than Moore's law and laying the groundwork for an intelligence explosion. Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to be train models is probably happening faster than Moore's law, but not the actual intelligence of the models.

To remove the confusing part about Moore’s law, I could re-word it like this:

Progress toward AGI does not seem very fast, not fast enough to lay the groundwork for an intelligence explosion within anything like 5 years. Progress seems actually pretty slow and incremental, with a moderate improvement from GPT-3.5 to GPT-4, and another moderate improvement from GPT-4 to o3-mini. The decline in the costs of running the models or the increase in the compute used to be train models is probably very large, but the actual intelligence of the models seems to be increasing only a bit with each new major version.

I think this conveys my meaning better than what I wrote originally, and it avoids getting into the Moore’s law topic.

The Moore’s law topic is a bit of an unnecessary rabbit hole. A lot of things increase faster than Moore’s law during a short window of time, but few increase at a CAGR of 41% (or whatever Moore’s law’s CAGR is) for decades. There’s all kinds of ways to mis-apply the analogy of Moore’s law.

People have made jokes about this kind of thing before, like The Economist sarcastically forecasting in 2006 based on then-recent trends that a 14-blade razor would be released by 2010.

I also think of David Deutsch’s book The Beginning of Infinity, in which he rails against the practice of uncritically extrapolating past trends forward, and his TED Talk where he does a bit of the same.

calebpApr 98

My impression is that ARC-AGI (1) is close to being solved, which is why they brought our ARC-AGI-2 a few weeks ago.

Benchmarks are often adversarially selected so they take longer to be saturated, so I don't think little progress on ARC-AGI-2 a few weeks after release (and iirc after any major model release) tells us much at all.

Yarrow🔸Apr 93

It depends what you want ARC-AGI-2 to tell you. For one, it tells you that current frontier models lack the general intelligence or “fluid intelligence” to solve simple puzzles that pretty much any person can solve. Why is that? Isn’t that interesting?

Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesn’t that show they are lacking in the capability to generalize to novel problems? If they don’t have to be specifically fine-tuned, then the timing shouldn’t matter. A model with good generalization capability should be able to do well whether it happens to be released before or after the reveal of the ARC-AGI-2 benchmark.

Another “benchmark” I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that “benchmark” has been much, much slower than Moore’s law, but, then again, I don’t know if anyone’s been able to accurately measure that.

The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I haven’t seen signs of anything but modest improvement over the last ~2.5 years. I also don’t see many people trying to quantify those things.

On one level, that makes sense because it takes time, money/labour, and expertise to create a good benchmark and there is no profit in it. You don’t seem to get much acclaim, either. Also, you might feel like you wasted your time if you made a benchmark that frontier AI models got ~0% on and, a year later, they still got ~0%…

On another level, measuring AGI progress carefully and thoughtfully seems important and it’s a bit surprising/disappointing that the status quo for benchmarks is so poor.

calebpApr 93

Why should it matter whether new models have been released after the reveal of ARC-AGI-2? If models have to be specifically fine-tuned for these tasks, doesn’t that show they are lacking in the capability to generalize to novel problems?

The main reason is that the benchmark has been pretty adversarially selected, so it's not clear that it's pointing at a significant lack in LM capabilities. I agree that it's weak evidence that they can't generalise to novel problems, but basically all of the update is priced in from just interacting with systems and noticing that they are better in some domains than others.

For one, it tells you that current frontier models lack the general intelligence or “fluid intelligence” to solve simple puzzles that pretty much any person can solve. Why is that? Isn’t that interesting?

I disagree that ARC-AGI is strong evidence against LMs not having "fluid intelligence" - I agree that was the intention of the benchmark, and I think it's weak evidence.

Another “benchmark” I mused about is the ability of AI systems to generate profit for their users by displacing human labour. It seems like improvement on that “benchmark” has been much, much slower than Moore’s law, but, then again, I don’t know if anyone’s been able to accurately measure that.

Has this been a lot slower than Moore's law? I think OpenAI revenue is, on average, more aggressive than Moore's law. I'd guess that LM ability to automate intellectual work is more aggressive than Moore's law, too, but it started from a very low baseline, so it's hard to see. Subjectively, LMs feel like they should be having a larger impact on the economy than they currently are. I think this is more related to horizon length than fluid intelligence, but 🤷‍♂️.

The bigger picture is that LLMs have extremely meagre capabilities in many cognitive domains and I haven’t seen signs of anything but modest improvement over the last ~2.5 years. I also don’t see many people trying to quantify those things.

I'm curious for examples here - particularly if they are the kinds of things that LMs have affordances for, are intellectual tasks, and are at least moderately economically valuable (so that someone has actually tried to solve).

Yarrow🔸Apr 122

I disagree with what you said about ARC-AGI and ARC-AGI-2, but it doesn't seem worth getting into.

I think OpenAI revenue is, on average, more aggressive than Moore's law.

I tried to frame the question to avoid counting the revenue or profit of AI companies that sell AI as a product or service. I said:

the ability of AI systems to generate profit for their users by displacing human labour.

Generating profit for users is different from generating profit for vendors. Generating profit for users would mean, for example, that OpenAI's customers are generating more profit for themselves by using OpenAI's models than they were before using LLMs.

I'd guess that LM ability to automate intellectual work is more aggressive than Moore's law, too, but it started from a very low baseline, so it's hard to see.

I realized in some other comments on this post (here and here) that trying to compare these kinds of things to Moore's law is a mistake. As you mentioned, if you start from a low enough baseline, all kinds of things are faster than Moore's law, at least for a while. Also, if you measure all kinds of normal trends within a selective window of time (e.g. number of sandwiches eaten per day from Monday to Tuesday increased from 1 to 2, indicating an upward trajectory many orders of magnitude faster than Moore's law), then you can get a false picture of astronomically fast growth.

Back to the topic of profit... In an interview from sometime in the past few years, Demis Hassabis said that LLMs are mainly being used for "entertainment". I was so surprised by this because you wouldn't expect a statement that sounds so dismissive from someone in his position.

And yet, when I thought about it, that does accurately characterize a lot of what people have used LLMs for, especially initially in 2022 and 2023.

So, to try to measure the usefulness of LLMs, we have to exclude entertainment use cases. To me, one simple, clean way to do that is to measure the profit that people generate by using LLMs. If a corporation, a small business, or a self-employed person pays to use (for example) OpenAI's models, for example, can they increase their profits? And, if so, how much has that increase in profitability changed (if it all) over time, e.g., from 2023 to 2025?

(We would still have to close some loopholes. For example, if a company pays to use OpenAI's API and then just re-packages OpenAI's models for entertainment purposes, then that shouldn't count, since that's the same function I wanted to exclude from the beginning and the only thing that's different is an intermediary has been added.)

I haven't seen much hard data on changes in firm-level profitability or firm-level productivity among companies that adopt LLMs. One of the few sources of data I can find is this study about customer support agents: https://academic.oup.com/qje/article/140/2/889/7990658 The paper is open access.

Here's an interesting quote:

In Figure III, Panels B–E we show that less skilled agents consistently see the largest gains across our other outcomes as well. For the highest-skilled workers, we find mixed results: a zero effect on AHT [Average Handle Time] (Panel B); a small but positive effect for CPH [Chats Per Hour] (Panel C); and, interestingly, small but statistically significant decreases in RRs [Resolution Rates] and customer satisfaction (Panels D and E).

These results are consistent with the idea that generative AI tools may function by exposing lower-skill workers to the best practices of higher-skill workers. Lower-skill workers benefit because AI assistance provides new solutions, whereas the best performers may see little benefit from being exposed to their own best practices. Indeed, the negative effects along measures of chat quality—RR and customer satisfaction—suggest that AI recommendations may distract top performers or lead them to choose the faster or less cognitively taxing option (following suggestions) rather than taking the time to come up with their own responses. Addressing this outcome is potentially important because the conversations of top agents are used for ongoing AI training.

My main takeaway from this study is that this seems really underwhelming. Maybe worse than underwhelming.

Greg_Colbourn ⏸️ Apr 7-2

This is somewhat disingenuous. o3-mini (high) is actually on 1.5%, and none of the other models are reasoning (CoT / RL / long inference time) models (oh, and GPT 4.5 is actually on 0.8%). The actual leaderboard looks like this:

Yes the scores are still very low, but it could just be a case of the models not yet "grokking" such puzzles. In a generation or two they might just grok them and then jump up to very high scores (many benchmarks have gone like this in the past few years).

Yarrow🔸Apr 8*2

I was not being disingenuous and I find your use of the word "disingenuous" here to be unnecessarily hostile.

I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.

So that we don't miss the bigger point, I want to reiterate that ARC-AGI-2 is designed to be solved by near-term, sub-AGI AI models with some innovation on the status quo, not to stump them forever. This is François Chollet describing the previous version of the benchmark, ARC-AGI, in a post on Bluesky from January 6, 2025:

I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.

It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.

To reiterate, ARC-AGI and ARC-AGI-2 are not tests of AGI. It is a test of whether a small, incremental amount of progress toward AGI has occurred. The idea is for ARC-AGI-2 to be solved, hopefully within the next few years and not, like, ten years from now, and then to move on to ARC-AGI-3 or whatever the next benchmark will be called.

Also, ARC-AGI was not a perfectly designed benchmark (for example, Chollet said about half the tasks turned out to be flawed in a way that made them susceptible to "brute-force program search") and ARC-AGI-2 is not a perfectly designed benchmark, either.

ARC-AGI-2 is worth talking about because most, if not all, of the commonly used AI benchmarks have very little usefulness for quantifying general intelligence or quantifying AGI progress. It's the problem of bad operationalization leading to distorted conclusions, as I discussed in my previous comment.

I don't know of other attempts to benchmark general intelligence (or "fluid intelligence") or AGI progress with the same level of carefulness and thoughtfulness as ARC-AGI-2. I would love to hear if there are more benchmarks like this.

One suggestion I've read is that a benchmark should be created with a greater diversity of tasks, since all of ARC-AGI-2 tasks are part of the same "puzzle game" (my words).

There's a connection between frontier AI models' failures on a relatively simple "puzzle game" like ARC-AGI-2 and why we don't see AI models showing up in productivity statistics, real per capita GDP growth, or taking over jobs. When people try to use AI models for practical tasks in the real world, their usefulness is quite constrained.

I understand the theory that AI will have a super fast takeoff, so that even though it isn't very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present. People can and did make this argument before ChatGPT, before AlphaGo, even before AlexNet. Ray Kurzweil has been saying this since at least the 1990s.

It's important to have good, constrained, scientific benchmarks like ARC-AGI-2 and hopefully some people will develop another one, maybe with more task diversity. Other good "benchmarks" are economic and financial data around employment, productivity, and economic growth. Can AI actually do useful things that generate profit for users and that displace human labour?

This is a nuanced question, since there are models like AlphaFold (and AlphaFold 2 and 3) that can, at least in theory, improve scientific productivity, but which are narrow in scope and do not exhibit general intelligence or fluid intelligence. You have to frame the question carefully, in a way that actually tests what you want to test.

For example, using LLMs as online support chatbots, where humans are already usually following scripts and flow charts, and for which conventional "Software 1.0" was largely already adequate, is somewhat cool and impressive, but doesn't feel like a good test of general intelligence. A much better sign of AGI progress would be if LLM-based models were able to replace human labour in multiple sorts of jobs where it is impossible to provide precise, step-by-step written instructions.

To frame the question properly would require thought, time, and research.

Greg_Colbourn ⏸️ Apr 89

I think Chollet has shifted the goal posts a bit from when he first developed ARC [ARC-AGI 1]. In his original paper from 2019, Chollet says:

"We argue that ARC [ARC-AGI 1] can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans."

And the original announcement (from June 2024) says:

A solution to ARC-AGI [1], at a minimum, opens up a completely new programming paradigm where programs can perfectly and reliably generalize from an arbitrary set of priors. We also believe a solution is on the critical path towards AGI"

(And ARC-AGI 1 has now basically been solved). You say:

I understand the theory that AI will have a super fast takeoff, so that even though it isn't very capable now, it will match and surpass human capabilities within 5 years. But this kind of theory is consistent with pretty much any level of AI performance in the present.

But we are seeing a continued rapid improvement in A(G)I capabilities, not least along the trajectory to automating AGI development, as per the METR report Ben West mentions.

Yarrow🔸Apr 9-1

In his interview with Dwarkesh Patel in June 2024 to talk about the launch of the ARC Prize, Chollet emphasized how easy the ARC-AGI tasks were for humans, saying that even children could do them. This is not something he’s saying only now in retrospect that the ARC-AGI tasks have been mostly solved.

That first quote, from the 2019 paper, is consistent with Chollet’s January 2025 Bluesky post. That second quote is not from Chollet, but from Mike Knoop. I don’t know what the first sentence is supposed to mean, but the second sentence is also consistent with the Bluesky post.

In response to the graph… Just showing a graph go up does not amount to a “trajectory to automating AGI development”. The kinds of tasks AI systems can do today are very limited in their applicability to AGI research and development. That has only changed modestly between ChatGPT’s release in November 2022 and today.

In 2018, you could have shown a graph of go performance increasing from 2015 to 2017 and that also would not have been evidence of a trajectory toward automating AGI development. Nor would AlphaZero’s tripling of the games a single AI system can master from go to go, chess, and shogi. Measuring improved performance on tasks only provides evidence for AGI progress if the tasks you are measuring test for general intelligence.

Greg_Colbourn ⏸️ Apr 84

I was not being disingenuous and I find your use of the word "disingenuous" here to be unnecessarily hostile.
I was going off of the numbers in the recent blog post from March 24, 2025. The numbers I stated were accurate as of the blog post.

GPT-2 is not mentioned in the blog post. Nor is GPT-3. Or GPT3.5. Or GPT-4. Or even GPT-4o! You are writing 0.0% a lot for effect. In the actual blog post, there are only two 0.0% entries, for "gpt-4.5 (Pure LLM)", and "o3-mini-high (Single CoT)"; and note the limitations in parenthesis, which you also neglect to include in your list (presumably for effect? Given their non-zero scores when not limited in such ways.)

Yarrow🔸Apr 91

It seems like you are really zeroing in on nitpicky details that make barely any difference to the substance of what I said in order to accuse me of being intentionally deceptive. This is not a cool behaviour.

I am curious to see what will happen in 5 years when there is no AGI. How will people react? Will they just kick their timelines 5 years down the road and repeat the cycle? Will some people attempt to resolve the discomfort by defining AGI as whatever exists in 5 years? Will some people be disillusioned and furious?

I hope that some people engage in soul searching about why they believed AGI was imminent when it wasn’t. And near the top of the list of reasons why will be (I believe) intolerance of disagreement about AGI and hostility to criticism of short AGI timelines.

Greg_Colbourn ⏸️ Apr 91

I don't think it's nitpicky at all. A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s, as Ben West points out.

I am curious to see what will happen in 5 years when there is no AGI.

If this happens, we will at least know a lot more about how AGI works (or doesn't). I'll be happy to admit I'm wrong (I mean, I'll be happy to still be around, for a start^[1]).

^{^}
I think the most likely reason we won't have AGI in 5 years is that there will be a global moratorium on further development. This is what I'm pushing for.

Yarrow🔸Apr 12-1

A trend showing small, increasing numbers, just above 0, is very different (qualitatively) to a trend that is all flat 0s

Then it's a good thing I didn't claim there was "a trend that is all flat 0s" in the comment you called "disingenuous". I said:

It's only with the o3-low and o1-pro models we see scores above 0% — but still below 5%. Getting above 0% on ARC-AGI-2 is an interesting result and getting much higher scores on the previous version of the benchmark, ARC-AGI, is an interesting result. There's a nuanced discussion to be had about that topic.

This feels like such a small detail to focus on. It feels ridiculous.

Matrice JacobineApr 620

I'd like to thank Sam Altman, Dario Amodei, Demis Hassabis, Yann LeCun, Elon Musk, and several others who declined to be named for giving me notes on each of the sixteen drafts of this post I shared with them over the past three months. Your feedback helped me polish a rough stone of thought into a diamond of incisive criticism.

??? Was this meant for April's Fools Day? I'm confused.

Yarrow🔸Apr 75

No, that part was just a joke because I'm a jokester. That's the only part of the post that's a joke. The rest is completely serious.

NickLaingApr 75

I appreciated it :)

Greg_Colbourn ⏸️ Apr 712

It seems like a group of people just saying increasingly small numbers to each other (10 years, 5 years, 3 years, 2 years), hyping each other up

This is very uncharitable. Especially in light of the recent AI 2027 report, which goes into a huge amount of detail (see also all the research supplements).

Yarrow🔸Apr 84

There is a good post about the AI 2027 report here. I do not think I am being uncharitable.

Greg_Colbourn ⏸️ Apr 88

In another comment you accuse me of being "unnecessarily hostile". Yet to me, your whole paragraph in the OP here is unnecessarily hostile (somewhat triggering, even):

The community of people most focused on keeping up the drumbeat of near-term AGI predictions seems insular, intolerant of disagreement or intellectual or social non-conformity (relative to the group's norms), and closed-off to even reasonable, relatively gentle criticism (whether or not they pay lip service to listening to criticism or perform being open-minded). It doesn't feel like a scientific community. It feels more like a niche subculture. It seems like a group of people just saying increasingly small numbers to each other (10 years, 5 years, 3 years, 2 years), hyping each other up (either with excitement or anxiety), and reinforcing each other's ideas all the time. It doesn't seem like an intellectually healthy community.

Calling that sentence uncharitable was an understatement.

For instance, you don't acknowledge that the top 3 most cited AI scientists of all time, all have relatively short timelines now.

As for the post you link, it starts with "I have not read the whole thing in detail". I think far too many people critiquing it have not actually read it properly. If they did read it all in detail, they might find that their objections have been answered in one of the many footnotes, appendices, and accompanying research reports. It concludes with "It doesn't really engage with my main objections, nor is it trying to do so", but nowhere are the main objections actually stated! It's all just meta commentary.

Yarrow🔸Apr 9-1

I think you are misusing the concept of charity. Or maybe we just disagree on what it means to be charitable or uncharitable in this context because we strongly disagree on the subject matter.

You linked to the website for Ilya Sutskever’s company as a citation for the claim that Ilya Sutskever has a relatively short AGI timeline. The website doesn’t mention a timeline and I can’t find an instance of Ilya Sutskever mentioning a specific timeline.

Yoshua Bengio gave a timeline of 5 to 20 years in 2023, so that’s 3 to 18 years now. He says he’s 95% confident in this prediction. Okay.

Geoffrey Hinton also says 5 to 20 years, but only with 50% confidence. Hmm. Well, 95% vs. 50% is a big discrepancy, right? Also, he’s been saying "5 to 20 years" since 2023, which, if we just take that at face value, means he’s actually been pushing back his timeline by about 1-2 years over the past 1-2 years.

I think the person who wrote the Tumblr post is pretty clear on what their problem with the AI 2027 report is. To treat the report as an actual prediction about the future, it requires you to be on board with a lot of modelling assumptions. And if you’re not already on board with those modelling assumptions, the report doesn’t do much to try to convince you. The post gives a specific example of this: the “software intelligence explosion” concept.

Greg_Colbourn ⏸️ Apr 96

Ilya's company website says "Superintelligence is within reach." I think it's reasonable to interpret that as having a short timeline. If not an even stronger claim that he thinks he knows how to actually build it.

The post gives a specific example of this: the “software intelligence explosion” concept.

Right, and doesn't address any of the meat in the methodology section.

Yarrow🔸Apr 121

Looking at the methodology section you linked to, this really just confirms the accuracy of nostalgebraist's critique, for me. (nostalgebraist is the Tumblr blogger.) There are a lot of guesses and intuitions. Such as:

Overall we’d guess that this is the sort of limitation that would take years to overcome—but not decades; just look at the past decade of progress and consider how many similar barriers have been overcome. E.g. in the history of game-playing RL AIs, we went from AlphaGo to EfficientZero in about a decade.

Remember, we are assuming SC is reached in Mar 2027. We think that most possible barriers that would block SAR from being feasible in 2027 would also block SC from being feasible in 2027.

So in this case we guess that with humans doing the AI R&D, it would take about 2-15 years.

Okay? I'm not necessarily saying this is an unreasonable opinion. I don't really know. But this is fundamentally a process of turning intuitions into numbers and turning numbers into a mathematical model. The mathematical model doesn't make the intuitions any more (or less) correct.

Why not 2-15 months? Why not 20-150 years? Why not 4-30 years? It's ultimately about what the authors intuitively find plausible. Other well-informed people could reasonably find very different numbers plausible.

And if you swap out more of the authors' intuitions for other people's intuitions, the end result might be AGI in 2047 or 2077 or 2177 instead of 2027.

Edit: While looking up something else, I found this paper which attempts a similar sort of exercise as the AI 2027 report and gets a very different result.

Greg_Colbourn ⏸️ Apr 224

I found this paper which attempts a similar sort of exercise as the AI 2027 report and gets a very different result.

This is an example of the multiple stages fallacy (as pointed out here), where you can get arbitrarily low probabilities for anything by dividing it up enough and assuming things are uncorrelated.

David Mathers🔸Apr 2211

I don't find accusations of fallacy helpful here. The author's say in the abstract explicitly that they estimated the probability of each step conditional on the previous ones. So they are not making a simple, formal error like multiplying a bunch of unconditional probabilities whilst forgetting that only works if the probabilities are uncorrelated. Rather, you and Richard Ngo think that they're estimates for the explicitly conditional probabilities are too low, and you are speculating that this is because they are still really think of the unconditional probabilities. But I don't think "you are committing a fallacy" is a very good or fair way to describe "I disagree with your probabilities and I have some unevidenced speculation about why you are giving probabilities that are wrong".

Greg_Colbourn ⏸️ Apr 234

Saying they are conditional does not mean they are. For example, why is P(We invent a way for AGIs to learn faster than humans|We invent algorithms for transformative AGI) only 40%? Or P(AGI inference costs drop below $25/hr (per human equivalent)^[1]|We invent algorithms for transformative AGI) only 16%!? These would be much more reasonable as unconditional probabilities. At the very least, "algorithms for transformative AGI" would be used to massively increase software and hardware R&D, even if expensive at first, such that inference costs would quickly drop.

^{^}
As an aside, surely this milestone has basically now already been reached? At least for the 90% percentile human in most intellectual tasks.

David Mathers🔸Apr 232

I don't think you can possibly know whether they really are actually thinking of the unconditional probabilities or whether they just have very different opinions and instincts from you about the whole domain which make very different genuinely conditional probabilities seem reasonable.

Greg_Colbourn ⏸️ Apr 232

It just looks a lot like motivated reasoning to me - kind of like they started with the conclusion and worked backward. Those examples are pretty unreasonable as conditional probabilities. Do they explain why "algorithms for transformative AGI" are very unlikely to meaningfully speed up software and hardware R&D?

Yarrow🔸Apr 221

One of the authors responds to the comment you linked to and says he was already aware of the concept of the multiple stages fallacy when writing the paper.

But the point I was making in my comment above is how easy it is for reasonable, informed people to generate different intuitions that form the fundamental inputs of a forecasting model like AI 2027. For example, the authors intuit that something would take years, not decades, to solve. Someone else could easily intuit it will take decades, not years.

The same is true for all the different intuitions the model relies on to get to its thrilling conclusion.

Since the model can only exist by using many such intuitions as inputs, ultimately the model is effectively a re-statement of these intuitions, and putting these intuitions into a model doesn’t make them any more correct.

In 2-3 years, when it turns out the prediction of AGI in 2027 is wrong, it probably won’t be because of a math error in the model but rather because the intuitions the model is based on are wrong.

Greg_Colbourn ⏸️ Apr 234

If they were already aware, they certainly didn't do anything to address it, given their conclusion is basically a result of falling for it.

It's more than just intuitions, it's grounded in current research and recent progress in (proto) AGI. To validate the opposing intuitions (long timelines) requires more in the way of leaps of faith (to say that things will suddenly stop working as they have been). Longer timelines intuitions have also been proven wrong consistently over the last few years (e.g. AI constantly doing things people predicted were "decades away" just a few years, or even months, before).

Steven ByrnesApr 910

The community of people most focused on keeping up the drumbeat of near-term AGI predictions seems insular, intolerant of disagreement or intellectual or social non-conformity (relative to the group's norms), and closed-off to even reasonable, relatively gentle criticism (whether or not they pay lip service to listening to criticism or perform being open-minded). It doesn't feel like a scientific community. It feels more like a niche subculture. It seems like a group of people just saying increasingly small numbers to each other (10 years, 5 years, 3 years, 2 years), hyping each other up (either with excitement or anxiety), and reinforcing each other's ideas all the time. It doesn't seem like an intellectually healthy community.

I’m someone who doesn’t think foundation models will scale to AGI. Here is my most recent field report from talking to a couple dozen AI safety / alignment researchers at EAG bay area a couple months ago:

Practically everyone was intensely interested in why I don’t think foundation models will scale to AGI—so much so that it got annoying, because I was giving the same spiel over and over, when there were many other interesting things that I kinda wanted to talk about.
There were a number of people, all quite new to the fields of AI and AI safety / alignment, for whom it seems to have never crossed their mind until they talked to me that maybe foundation models won’t scale to AGI, and likewise who didn’t seem to realize that the field of AI is broader than just foundation models.
There were a (quite small) number of people who generally agreed with me. These included one or two agent foundations researchers, and another person (not in the field) who thought the whole AGI thing was stupid (so then I was arguing on the other side that we should expect AGI sooner or later, and that it wasn’t centuries away, even if the AGI is not a foundation model).
Putting those two groups aside, everyone else understood what I was talking about and mostly immediately had substantive counterarguments, and I had responses to those, etc.
(I’m not sure how you distinguish between “pay lip service to listening to criticism or perform being open-minded” versus “are actually listening to criticism and are actually being open-minded, but are disagreeing with the criticism”??)
Most people actually wanted to defend something weaker, like “foundation models in conjunction with yet-to-be-invented modifications and scaffolding and whatnot will scale to AGI” (for my part, I think this weaker claim is also wrong).
I think it’s worth distinguishing people’s gut beliefs from their professed probability distributions. Their professed probabilities almost always include some decent chunk in the scenario that foundation models won’t scale to AGI, but rather it will be a totally different AI paradigm. (By “decent chunk” I mean 10% or 20% or whatever.) But they’re spending most of their time thinking and talking from their gut belief, forgetting the professed probabilities. (I do this too.)

Yarrow🔸Apr 90

Thank you for sharing your experience.

The good: it sounds like you talked to a lot of people who were eager to hear a differing opinion.

The bad: it sounds like you talked to a lot of people who had never even heard a differing opinion before and hadn’t even considered that a differing opinion could exist.

I have to say, the bad part supports my observation!

When I talk about paying lip service to the idea of being open-minded vs. actually being open-minded, ultimately how you make that distinction is going to be influenced by what opinions you hold. I don’t think there is a 100% impartial, objective way of making that distinction.

What I have in mind in this context when I talk about lip service vs. actual open-mindedness is stuff like how a lot of people who believe in the scaling hypothesis and short AGI timelines have ridiculed and dismissed Yann LeCun (for example here, but also so many other times before that) for saying that autoregressive LLMs will never attain AGI. If you want to listen to a well-informed, well-qualified critic, you couldn’t ask for someone better than Yann LeCun, no? So, why is the response dismissal and ridicule rather than engaging with the substance of his arguments, “steelmanning”, and all that?

Also, when you set the two poles of the argument as people who have 1-year AGI timelines at one pole and people who have 20-year AGI timelines at the opposite pole, you really constrain the diversity of perspectives you are hearing. If you have vigorous debates with people who already broadly agree you on the broad strokes, you are hearing criticism about the details but not about the broad strokes. That’s a problem with insularity.

Greg_Colbourn ⏸️ Apr 99

you couldn’t ask for someone better than Yann LeCun, no?

Really? I've never seen any substantive argument from LeCun. He mostly just presents very weak arguments (and ad hominem) on social media, that are falsified within months (e.g. his claims about LLMs not being able to world model). Please link to the best written one you know of.

Yarrow🔸Apr 121

I don't think it's a good idea to engage with criticism of an idea in the form of meme videos from Reddit designed to dunk on the critic. Is that intellectually healthy?

I don't think the person who made that video or other people who want to dunk on Yann LeCun for that quote understand what he was trying to say. (Benjamin Todd recently made the same mistake here.) I think people are interpreting this quote hyper-literally and missing the broader point LeCun was trying to make.

Even today, in April 2025, models like GPT-4o and o3-mini don't have a robust understanding of things like time, causality, and the physics of everyday objects. They will routinely tell you absurd things like that an event that happened in 2024 was caused by an event in 2025, while listing the dates of the events. Why don't LLMs, still, in April 2025 consistently understand that causes precede effects and not vice versa?

If anything, this makes it seem like what LeCun said in January 2022 seem prescient. Despite a tremendous amount of scaling of training data and training compute, and, more recently, significant scaling of test-time compute, the same fundamental flaw LeCun called out over 3 years ago remains a flaw in the latest LLMs.

All that being said... I think even if LeCun had made the claim that I think people are mistakenly interpreting him as making and he had turned out to have been wrong about that, discrediting him based on him being wrong about that one thing would be ridiculously uncharitable.

Ben_West🔸Apr 12*2

I have to say, the bad part supports my observation!

Steven was responding to this:

The community of people most focused on keeping up the drumbeat of near-term AGI predictions seems insular, intolerant of disagreement or intellectual or social non-conformity (relative to the group's norms), and closed-off to even reasonable, relatively gentle criticism

None of Steven's bullet points support this. Many of them say the exact opposite of this.

Yarrow🔸Apr 123

Unless I misinterpreted what Steven was trying to say, this supports my observation in the OP about insularity:

There were a number of people, all quite new to the fields of AI and AI safety / alignment, for whom it seems to have never crossed their mind until they talked to me that maybe foundation models won’t scale to AGI, and likewise who didn’t seem to realize that the field of AI is broader than just foundation models.

How could you possibly never encounter the view that "foundation models won't scale to AGI"? How could an intellectually healthy community produce this outcome?

Steven ByrnesMay 52

There’s a popular mistake these days of assuming that LLMs are the entirety of AI, rather than a subfield of AI.

If you make this mistake, then you can go from there to either of two faulty conclusions:

(Faulty inference 1) Transformative AI will happen sooner or later [true IMO] THEREFORE LLMs will scale to TAI [false IMO]
(Faulty inference 2) LLMs will never scale to TAI [true IMO] THEREFORE TAI will never happen [false IMO]

I have seen an awful lot of both (1) and (2), including by e.g. CS professors who really ought to know better (example), and I try to call out both of them when I see them.

You yourself seem mildly guilty of something-like-(2), in this very post. Otherwise you would be asking questions like “how quickly can AI paradigms go FROM obscure and unimpressive arxiv papers that nobody has heard of, TO a highly-developed technique subject to untold billions of dollars and millions of person-hours of investment?”, and you’d notice that an answer like “5 years” is not out of the question. (See second half of this comment.)

I’m not sure how you define “imminent” in the OP title, but FWIW, LLM skeptic Yann LeCun says human-level AI “will take several years if not a decade…[but with] a long tail”, and LLM skeptic Francois Chollet says 2038-2048.

Ben_West🔸Apr 121

You had never thought through "whether artificial intelligence could be increasing faster than Moore’s law." Should we conclude that AI risk skeptics are "insular, intolerant of disagreement or intellectual or social non-conformity (relative to the group's norms), and closed-off to even reasonable, relatively gentle criticism?"

Yarrow🔸Apr 121

That seems like a non-sequitur and it seems like a calculated insult and not a good faith effort to engage in the substance of my argument.

calebpApr 97

-Most AI experts are skeptical that scaling up LLMs could lead to AGI.

I don't think this is true. Do you have a source? My guess is that I wouldn't consider many of the people "experts".

-It seems like there are deep, fundamental scientific discoveries and breakthroughs that would need to be made for building AGI to become possible. There is no evidence we're on the cusp of those happening and it seems like they could easily take many decades.

I think this is a pretty strange take. It seems like basically all progress on AI has involved approximately 0 "deep, fundamental scientific discoveries", so I think you need some argument for why the trend will change. Alternatively, if you think we have made lots of discoveries and that explains AI progress so far, then you need an argument for why these discoveries will stop. Or, if you think we have made little AI progress since ~2010 then I think most readers would strongly disagree with you.

Yarrow🔸Apr 9*6

I don't think this is true. Do you have a source? My guess is that I wouldn't consider many of the people "experts".

The source is a report from the Association for the Advancement of Artificial Intelligence (AAAI): https://aaai.org/wp-content/uploads/2025/03/AAAI-2025-PresPanel-Report-Digital-3.7.25.pdf

Page 7 discusses who they surveyed:

…we also wanted to include the opinion of the entire AAAI community, so we launched an extensive survey on the topics of the study, which engaged 475 respondents, of which about 20% were students. Among the respondents, academia was given as the main affiliation (67%), followed by corporate research environment (19%). Geographically, the most represented areas are North America (53%), Asia (20%), and Europe (19%) . While the vast majority of the respondents listed AI as one of their primary fields of study, there were also mentions of other fields, such as neuroscience, medicine, biology, sociology, philosophy, political science, and economics.

Page 63 discusses the question about scaling:

The majority of respondents (76%) assert that “scaling up current AI approaches” to yield AGI is “unlikely” or “very unlikely” to succeed, suggesting doubts about whether current machine learning paradigms are sufficient for achieving general intelligence.

I have sources for the other specific claims made in the post as well and will provide them on request, but they also should be pretty easy to look up.

I think this is a pretty strange take.

I think it’s a pretty normal take. If you want to hear the version from a person who won a Turing Award for their contributions to AI, listen to Yann LeCun talk about it. Here’s a recent representative example: https://www.pymnts.com/artificial-intelligence-2/2025/meta-large-language-models-will-not-get-to-human-level-intelligence/

He’s given lots of talks and interviews where he goes into detail.

calebpApr 94

Thanks for the link, I haven't come across that report before.

I think Yann has pretty atypical views for people working on LMs. For example, if you take the reference classes of AI-related Turing award winners or Chief scientist types at AI labs, most are far more bullish on LMs (e.g., Hinton, Bengio, Ilya, Jared Kaplan, Schulman).

Yarrow🔸Apr 12*-4

Let me repeat something I said in the OP:

The community of people most focused on keeping up the drumbeat of near-term AGI predictions seems insular, intolerant of disagreement or intellectual or social non-conformity (relative to the group's norms), and closed-off to even reasonable, relatively gentle criticism (whether or not they pay lip service to listening to criticism or perform being open-minded). It doesn't feel like a scientific community. It feels more like a niche subculture. It seems like a group of people just saying increasingly small numbers to each other (10 years, 5 years, 3 years, 2 years), hyping each other up (either with excitement or anxiety), and reinforcing each other's ideas all the time. It doesn't seem like an intellectually healthy community.

My impression is that a lot of people who believe in short AGI timelines (e.g. AGI by January 1, 2030) and who believe in some strong version of the scaling hypothesis (e.g. LLMs will scale to AGI with relatively minor fundamental changes but with greatly increased training compute, inference compute, and/or training data) are in an echo chamber where they just reinforce each other's ideas all the time.

What might look like vigorous disagreement is, in many cases, when you zoom out, people with broadly similar views arguing around the margins (e.g. AGI in 3 years vs. 7 years; minimal non-scaling innovations on LLMs vs. modest non-scaling innovations on LLMs).

If people stop to briefly consider what a well-informed critic like Yann LeCun has to say about the topic, it's usually to make fun of him and move on.

It will seem more obvious that you're right if the people you choose to listen to are the people who broadly agree with you and if you meet well-informed disagreement from people like Yann Lecun or François Chollet with dismissal, ridicule, or hostility. This is a recipe for overconfidence. Taken to an extreme, this approach can lead people down a path where they end up deeply misguided.

calebpApr 96

I don't think they are designed to be a low bar to clear. They seem very adversarially selected, though I agree that LMs do poorly on them relative to subjectively more difficult tasks like coding. It seems pretty hard to make a timelines update from ARC-AGI unless you are very confident in the importance of abstract shape rotation problems for much more concrete problems, or you care about some notion of "intelligence" much more than automating intellectual labour.

Yarrow🔸Apr 91

I don't think they are designed to be a low bar to clear.

Based on what?

This is what François Chollet said about ARC-AGI in a post on Bluesky from January 6, 2025:

I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.

It was designed as the simplest, most basic assessment of fluid intelligence possible.

Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations. Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.

On Dwarkesh Patel’s podcast, Chollet emphasized that pretty much anybody can solve ARC-AGI puzzles, even children.

It seems pretty hard to make a timelines update from ARC-AGI unless you are very confident in the importance of abstract shape rotation problems for much more concrete problems, or you care about some notion of "intelligence" much more than automating intellectual labour.

You’ve got to measure something and the most commonly cited benchmarks for LLMs mostly seem to measure memorizing large quantities of text with very limited generalization to novel chunks of text. That’s cool, but I don’t think it’s measuring general intelligence.

ARC-AGI and the new and improved ARC-AGI-2 are specifically designed to measure progress toward AGI by focusing on capabilities that humans have and AI doesn’t. I don’t know if it succeeds in measuring general intelligence, but I find it a lot more interesting than the benchmarks that reward memorizing text.

I think it would be a good idea for others to take inspiration from ARC-AGI-2 and design new benchmarks that specifically focus on what humans can do ~100% of the time and what AI can do ~0% of the time. If you don’t try to measure this, and you aren’t really careful and thoughtful in how you measure it, you risk ending up with distorted conclusions about AGI progress.

Yarrow🔸Apr 17-1

Yesterday, I watched this talk by François Chollet, which provides support for a few of the assertions I made in this post.