This isn't a naive or outdated concern. It's a case of a simplified example being misunderstood as the actual concern.
It's worth clarifying that Yudkowsky's squiggle maximizer has nothing to do with actual paperclips you can pick up with your hands.
Many people interpreted this to be about an AI that was specifically given the instruction of manufacturing paperclips, and that the intended lesson was of an outer alignment failure. i.e humans failed to give the AI the correct goal. Yudkowsky has since stated the originally intended lesson was of inner alignment failure, wherein the humans gave the AI some other goal, but the AI's internal processes converged on a goal that seems completely arbitrary from the human perspective.
The concern is about an AI manipulating atoms into an indefinitely repeating mass-energy efficient pattern, optimized along a (seemingly arbitrary) narrow dimension of reward.
Why might an AI do something unexpected like this? For reasons analogous to why a rational person will guess blue every time in the following card experiment, even though there are some red cards. Lawful Uncertainty demonstrates that even in environments with randomness, the optimal strategy is to follow a determinate pattern rather than matching the perceived probabilities of the environment. Similarly, an AI will optimize toward whatever actually maximizes its reward function, not what appears reasonable or balanced to humans.
This problem isn't prevented by RLHF or by an AI having a sufficiently nuanced understanding of what humans want. A model can demonstrate perfect comprehension of human values in its outputs while its internal optimization processes still converge toward something else entirely.
The apparent human-like reasoning we see in current LLMs doesn't guarantee their internal optimization targets match what we infer from their outputs.
The issue is not whether the AI understands human morality. The issue is whether it cares.
The arguments from the "alignment is hard" side that I was exposed to don't rely on the AI misinterpreting what the humans want. In fact, superhuman AI assumed to be better at humans at understanding human morality. It still could do things that go against human morality. Overall I get the impression you misunderstand what alignment is about (or maybe you just have a different association to words as "alignment" than me).
Whether a language model can play a nice character that would totally give back the dictatorial powers after takeover is barely any evidence whether the actual super-human AI system will step back from its position of world dictator after it has accomplished some tasks.
You say that there is a gap between how the model professes it will act and how it will actually act. However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world, so why would it? Saying that a model will make harmful, unhelpful choices is akin to saying the base model will output typos. Both of these things are trained against. If you refer to deceptive alignment, this is a engineering problem as I stated.
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
You refer to alignment faking/deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.
Some other counterarguements: