T

tobycrisford 🔸

840 karmaJoined
tobycrisford.github.io/

Comments
191

I don't disagree with much of this comment (to the extent that it puts o3's achievement in its proper context), but I think this is still inconsistent with your original "no progress" claim (whether the progress happened pre or post o3's ARC performance isn't really relevant). I suppose your point is that the "seed of generalization" that LLMs contain is so insignificant that it can be rounded to zero for practical purposes? That was true pre o3 and is still true now? Is that a fair summary of your position? I still think "no progress" is too bold!

But in addition, I think I also disagree with you that there is nothing exciting about o3's ARC performance.

It seems obvious that LLMs have always had some ability to generalize. Any time that they produce a coherent response that has not appeared verbatim in their training data, they are doing some kind of generalisation. And I think even Chollet has always acknowledged that too. I've heard him characterize LLMs (pre ARC success) as combining dense sampling of the problem space with an extremely weak ability to generalize, contrasting that with the ability of humans to learn from only a few examples. But there is still an acknowledgement here that some non-zero generalization is happening.

But if this is your model of how LLMs work, that their ability to generalize is extremely weak, then you don't expect them to be able to solve ARC problems. They shouldn't be able to solve ARC problems even if they had access to unlimited inference time compute. Ok, so o3 had 1,024 attempts at each task, but that doesn't mean it tried the task 1,024 times until it hit on the correct answer. That would be cheating. It means it tried the task 1,024 times and then did some statistics on all of its solutions before providing a single guess, which turned out to be right most of the time!

I think it is surprising and impressive that this worked! This wouldn't have worked with GPT-3. You could have given it chain of thought prompting, let it write as much as it wanted per attempt, and given it a trillion attempts at each problem, but I still don't think you would expect to find the correct answer dropping out at the end. In at least this sense, o3 was a genuine improvement in generalization ability.

And Chollet thought it was impressive too, describing it as a "genuine breakthrough", despite all the caveats that go with that (that you've already quoted).

When LLMs can solve a task, but only with masses of training data, then I think it is fair to contrast their data efficiency with that of humans and write off their intelligence as memorization rather than generalization. But when they can only solve a task by expending masses of inference time compute, I think it is harder to write that off in the same way. Mainly because: we don't really know how much inference time compute humans are using! (I don't think? Unless we understand the brain a lot better than I thought we did). I wouldn't be surprised at all if we find that AGI requires spending a lot of inference-time compute. I don't think that would make it any less AGI.

The exteme inference time compute costs are really important context to bear in mind when forecasting how AI progress is going to go, and what kinds of things are going to be possible. But I don't think it provides a reason to describe the intelligence as not "general", in the way that extreme data inefficiency does.

Because the ARC benchmark was specifically designed to be a test of general intelligence (do you disagree that it successfully achieves this?) and because each problem takes the form of requiring you to spot a pattern from only a couple of examples.

LLMs have made no progress on any of these problems.

 

I think this probably overstates things? For example, o3 was able to achieve human level performance on ARC-AGI-1, which I think counts as at least some kind of progress on the problems of generalization and data efficiency?

This is a really great exchange, and thank you for responding to the post.

I just wanted to leave a quick comment to say: It seems crazy to me that someone would say the "slow" scenario has "already been achieved"!

Unless I'm missing something, the "slow" scenario says that half of all freelance software engineering jobs taking <8 hours can be fully automated, that any task a competent human assistant can do in <1 hour can be fully automated with no drop in quality (what if I ask my human assistant to solve some ARC-2 problems for me?), that the majority of customer complaints in a typical business will be fully resolved by AI in those businesses that use it, and that AI will be capable of writing hit songs (at least if humans aren't made aware that it is AI-generated)?

I suppose the scenario is framed only to say that AI is capable of all of the above, rather than that it is being used like this in practice. That still seems like an incorrect summary of current capability to me, but is slightly more understandable. But in that case, it seems the scenario should have just been framed that way: "Slow progress: No significant improvement in AI capabilities from 2025, though possibly a significant increase in adoption". There could then be a separate question on what people think about the level that current capabilities are at?

Otherwise disagreements about current capabilities and progress are getting blurred in the single question. Describing the "slow" scenario as "slow" and putting it at the extreme end of the spectrum is inevitably priming people to think about current capabilities in a certain way. Still struggling to understand the point of view that says this is an acceptable way to frame this question.

This post is getting some significant downvotes. I would be interested if someone who has downvoted could explain the reason for that.

There's plenty of room for disagreement on how serious a mistake this is, whether it has introduced a 'framing' bias into other results or not, and what it means for the report as a whole. But it just seems straightforwardly true that this particular question is phrased extremely poorly (it seems disingenuous to suggest that the question using the phrasing "best matching" covers you for not even attempting to include the full range of possibilities in your list).

I assume that people downvoting are objecting to the way that this post is using this mistake to call the entire report into question, with language like "major flaw". They may have a point there. But I think you should have a very high bar for downvoting someone who is politely highlighting a legitimate mistake in a piece of research.

'Disagree' react to the 'major flaw' language if you like, and certainly comment your disagreements, but silently downvoting someone for finding a legitimate methodological problem in some EA research seems like bad EA forum behaviour to me!

I think I broadly agree with this.

I am very confused about your number 1 con though! Why would promoting frugality be perceived as the rich promoting their own interests over those of the poor? Isn't it exactly the other way around?

To the extent that EA is comfortable with people spending large sums of their money on unnecessary things, I think it is open to the 'elitism' criticism (think of the discussion around SBF's place in the bahamas). People can justifiably argue: "it is easy to say we should all be donating a lot to charity when you are so rich that you will still have enough left over to live in luxury!".

But if EA advocates frugality for everyone, including the super rich, then this seems like a powerful response to the elitism criticism. I would have put this near the top of the pros list!

I don't think longtermism is a nice solution to this problem. If you're open to letting astronomically large but unlikely scenarios dominate your expected value calculations, then I don't think this rounds out nicely to simply "reduce existential risk". The more accurate summary would be: reduce existential risk according to a worldview in which astronomical value is possible, which is likely to lead to very different recommendations than if you were to attempt to reduce existential risk unconditionally.

 https://forum.effectivealtruism.org/posts/RCmgGp2nmoWFcRwdn/should-strong-longtermists-really-want-to-minimize 
  

It sounds like MichaelDickens' reply is probably right, that we don't need to consider identical experiences in order for this argument to go through.

But the question of whether identical copies of the same experience have any additional value is a really interesting one. I used to feel very confident that they have no value at all. I'm now a lot more uncertain, after realising that this view seems to be in tension with the many worlds interpretation of quantum mechanics: https://www.lesswrong.com/posts/bzSfwMmuexfyrGR6o/the-ethics-of-copying-conscious-states-and-the-many-worlds 

It seems very strange to me to treat reducing someone's else chance of X differently to reducing your own  (if you're confident it would affect each of you similarly)! But thank you for engaging with these questions, it's helping me understand your position better I think.

By 'collapsing back to expected utility theory' I only meant that if you consider a large enough reference class of similar decisions, it seems like it will in practice be the same as acting as if you had an extremely low discount threshold? But it sounds like I may just not have understood the original approach well enough.

Thanks for writing this up! I'd be interested to read a response to these points.

Load more