When I talk about AI safety, especially to people unfamiliar to the area, I sometimes feel uncomfortably like I'm evoking terminator/sci-fi. This post was useful grounding for feelings, and seems more reflective of how technical AI safety people talk about AI threats.
I also think this framing is useful for encouraging us to be specific about whether we're concerned about LLM architectures, or if you believe risks will have to come after algorithmic improvements beyond transformers, neural nets etc.
Alignment and safety for LLMs mean that we should be able to quantify and bound the probability with which certain undesirable sequences are generated. The trouble is that we largely fail at describing "undesirable" except by example, which makes calculating bounds difficult.
For a given LLM (without random seed) and sequence, it is trivial to calculate the probability of the sequence to be generated. So if we had a way of somehow summing or integrating over these probabilities, we could say with certainty "this model will generate an undesirable sequence once every N model evaluations". We can't, currently, and that sucks, but at the heart, this is the mathematical and computational problem we'd need to solve.
If you talk to AI-Safety skeptics who say AI is non-existentially threatening, because its not conscious, capable of reasoning etc. I think this might be useful to send to them.
Estimated reading time, 8 minutes. I don't think its essential to understand the maths at the beginning.
Interesting HN discussion about the post here.
I appreciate you're sharing this as an alternative way of framing AI alignment for people who react badly to using anthropomorphic language to describe LLMs, and I can see it could be useful from that point of view. But I strongly disagree with the core argument being made in that blogpost.
The problem with saying that LLMs are just functions mapping between large vector spaces, is that functions mapping between large vector spaces can do an awful lot! If the brain is just a physical system operating according to known laws of physics, then its evolution in time can also be described as a mapping from R^n -> R^m for some huge n and m, because that's the form that the laws of physics take as well. If the evolution of the universe is described by Schrodinger's equation, then all time-evolution is just matrix multiplication!
There might be very good reasons to think that LLMs are a long way from having human-like intelligence, but saying this follows because they are just a mathematical function is a misleading rhetorical sleight of hand.
This comment prompted a lot of reflection, so thank you!
I don't think the blog post claims that LLMs are a long way from human-like intelligence. For what it's worth, I agree with your reasoning against that line of argument.
My main takeaway from this post's core about being mindful about the level of abstraction in language. Obviously for technical AI safety, the low-level, mechanistic view seems important. But it also seems like rhetorical sleight of hand to go with high-level anthropomorphic language to motivate people/make explanations easier. In good written up resources they lead with a fundamental understanding of e.g. how neural nets work (bluedot, global challenges project) but I personally think the movement could still bear this in mind more when introducing AI safety to newcomers. Needless to say hype language is also a problem in mainstream capabilities discussion.
Side note: On the analogy to physics itself, I'm not an expert, but I've also been told that the premise of the universe or brain being describable by purely linear maps is contested. Regardless of that, I'm not sure how pragmatically important the analogy is compared to the immediate choice of which abstraction to use for AI safety work.