AT

Alexander Turner

Research scientist @ Google DeepMind
7 karmaJoined Working (0-5 years)

Comments
1

Paul, I think deceptive alignment (or other spontaneous, stable-across-situations goal pursuit) after just pretraining is very unlikely. I am happy to take bets if you're interested. If so, email me (alex@turntrout.com), since I don't check this very much. 

I think that "deceptively aligned during pre-training" is closer to e.g. Eliezer's historical views.

I agree, and the actual published arguments for deceptive alignment I've seen don't depend on any difference between pretraining and finetuning, so they can't only apply to one. (People have tried to claim to me, unsurprisingly, that the arguments haven't historically focused on pretraining.)