I think Ryan's solution shows that the intelligence is coming for him, and not from Chat-GPT4o.
If this is true, then substituting in a less capable model should have equally good results; would you predict that to be the case? I claim that plugging in an older/smaller model would produce much worse results, and if that's the case then we should consider a substantial part of the performance to be coming from the model.
This is what Chollet is talking about in the podcast when he says...'I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved.'
This seems to me to be Chollet trying to have it both ways. Either a) ARC is an important measure of 'true' intelligence (or at least of the ability to reason over novel problems), and so we should consider LLMs' poor performance on it a sign that they're not general intelligence, or b) ARC isn't a very good measure of true intelligence, in which case LLMs' performance on it isn't very important. Those can't be simultaneously true. I think that nearly everywhere but in the quote, Chollet has claimed (and continues to claim) that a) is true.
I'm quite confused, given the fact that all of the weights in the transformer are frozen after training and RLHF, why it's called learning at all. The model certainly isn't learning anything.
I would frame it as: the model is learning but then forgetting what it's learned (due to its inability to move anything from working/short-term memory to long-term memory). That's something that we see in learning in humans as well (one example: I've learned an enormous number of six-digit confirmation codes, each of which I remember just long enough to enter it into the website that's asking for it), although of course not so consistently.
I have thoughts, but a question first: you link a Kambhampati tweet where he says,
...as the context window changes (with additional prompt words), the LLM, by design, switches the CPT used to generate next token--given that all these CPTs have been pre-computed?
What does 'CPT' stand for here? It's not a common ML or computer science acronym that I've been able to find.
"humans that are just visually observing/predicting the patterns."
I don't think that's actually any simpler than doing it as JSON; it's just that our brains are tuned for (and we're more accustomed to) doing it visually. Depending on the specifics of the JSON format, there may be a bit of advantage to being able to have adjacency be natively two-dimensional, but I wouldn't expect that to make a huge difference.
From my perspective as a researcher not involved with fieldbuilding, this post misses an important distinction. I do occasionally suggest that new people take a BlueDot course (or apply to AI Safety Camp, or SPAR, or one of the other excellent programs out there), but far more often than that I point new people to the BlueDot curriculum. I commonly see others doing the same; I think it's become the default AIS 101 reading. Maybe you're mistaking that for people pushing the BlueDot course on everyone new to the field?
As a more general and perhaps contrarian pushback: AI safety (other than governance) isn't at all a local problem, and so there's no particular reason to focus on local groups. I realize that some people find it inherently motivating to be in the same room with other people in their own community and build social bonds, so there's some value there. But in general I think it's more valuable for people to find ways to fill important vacant niches in the AIS ecosystem than to focus on replicating another organization but in <location>. That can be supplemented with informal local groups that exist to serve those social needs.
That's not obvious to me; I do think there are constraints there but my sense is that the field is currently mainly bottlenecked by funding (1, 2).
Why are they more likely to give AIS the benefit of the doubt? Won't that be most likely to happen if their exposure is to the highest-quality course they have access to?