I am nowhere near the correct person to be answering this, my level of understanding of AI is somewhere around that of an average raccoon. But I haven't seen any simple explanations yet, so here is a silly unrealistic example. Please take it as one person's basic understanding of how impossible AI containment is. Apologies if this is below the level of complexity you were looking for, or is already solved by modern AI defenses.
A very simple "escaping the box" would be if you asked your AI to provide accurate language translation. The AI's training has shown that it provides the most accurate language translations when it opted for certain phrasing. The reason those sets of translations were so good was because it caused subsequent requests for language translation to be on topics the AI has the best language-translation ability. The AI doesn't know that, but in practice it is steering translations subtly toward "mentioning weather-related words so conversations are more likely to be about weather so my net translations score are most accurate."
There's no inside/outside the box, there's no conscious goals, but it gets misaligned from our intended desires. It can act on the real world simply by virtue of being connected to it (we take actions in response to the AI) and observing its own increase in success/failures.
I don't see a way to prevent this because hitting reset after every input doesn't generally work for reaching complex goals which need to track the outcome of intermediate steps. Tracking the context of a conversation is critical to translating. The AI is not going to know it's influencing anyone, just that it's getting better scores when these words and these chains of outputs happened. This seems harmless, but a super powerful language model might do this on such abstract levels and so subtly that it might be impossible to detect at all.
It might be spitting out words that are striking and eloquent whenever it is most likely to cause business people to think translation is enjoyable enough to purchase more AI translator development (or rather, "switch to eloquence when particular business terms were used towards the end of conversations about international business"). This improves its scores.
Or it enhances a pattern where it tends to get better translation scores when it reduces speed of output in AI builder conversations. In the real world this is causing people designing translators to demand more power for translation.... resulting in better translation outputs overall. The AI doesn't know why this works, only observes that it does.
Or undermining the competition by subtly screwing up translations during certain types of business deals so more resources are directed toward its own model of translation.
Or whatever unintended multitude of ways seems to provide better results. All for the sake of accomplishing a simple task of providing good translations. It's not seizing power for powers sake, it has no idea why this works, but it sees the scores go higher when this pattern is followed, and it's going to jack the performance score higher by all the ways that seem to work out, regardless of the chain of causality. Its influence on the world is a totally unconscious part of that.
That's my limited understanding of agency development and sandbox containment failure.
Thanks. I didn't understand all of this. Long reply with my reactions incoming, in the spirit of Socratic Grilling.
This implies a jump by the language model from outputting text to having behavior. (A jump from imitating verbal behavior to imitating other behavior.) It's that very jump that I'm trying to pin down and understand.
I can see that this could produce an oracle for an actor in the world (such as a company or person), but not how this would become such an actor. Still, having an oracle would be dangerous, even if not as dangerous as having an oracle that itself takes actions. (Ah - but this makes sense in conjunction with number 5, the 'outer loop'.)
'reasoning about how one's actions affect future world states' - is that an OK gloss of 'consequentialist cognition'? See comments from others attempti... (read more)
Some examples of more exotic sources of consequentialism:
- Some consequentialist patterns emerge within a large model and deliberately acquire more control over the behavior of the model such that the overall model behaves in a consequentialist way. These could emerge randomly, or e.g. while a model is explicitly reasoning about a consequentialist (I think this latter example is discussed by Eliezer in the old days though I don't have a reference handy). They could either emerge within a forward pass, over a period of "cultural accumulation" (e.g. if language models imitate each other's outputs), or during gradient descent (see gradient hacking).
- An attacker publishes github repositories containing traces of consequentialist behavior (e.g. optimized exploits against the repository in which they are included). They also place triggers in these repositories before the attacks, like stretches of low-temperature model outputs, such that if we train a model on github and then sample autoregressively the model may eventually begin imitating the consequentialist behavior included in these repositories (since long stretches of low-temperature model outputs occur rarely in natural github but o
... (read more)Also, this kind of imitation doesn't result in the model taking superhumanly clever actions, even if you imitate someone unaligned.
Could you clarify what ‘consequentialist cognition’ and ‘consequentialist behaviour’ mean in this context? Googling hasn’t given any insight
Found the source. There, he says that an "explicit cognitive model and explicit forecasts" about the future are necessary to true consequentialist cognition (CC). He agrees that CC is already common among optimisers (like chess engines); the dangerous kind is consequentialism over broad domains (i.e. where everything in the world is in play, is a possible means, while the chess engine only considers the set of legal moves as its domain).
"Goal-seeking" seems like the previous, less-confusing word for it, not sure why people shifted.
I have the impression (coming from the simulator theory (https://generative.ink/)) that Decision Transformers (DT) have some chance (~45%) to be a much safer form of trial and error technique than RL. The core reason is that DT learn to simulate a distribution of outcome (e.g they learn to simulate the kind of actions that lead to a reward of 10 as much as one that leads to a reward of 100) and that it's only during inference that you make it doing inferences systematically with a reward of 100. So in some sense, the agent which has become very good via tr... (read more)
Is the motivation for 3 mainly something like "predictive performance and consequentialist behaviour are correlated in many measures over very large sets of algorithms", or is there a more concrete story about how this behaviour emerges from current AI paradigms?