Part of your question here seems to be, "If we can design a system that understands goals written in natural language, won't it be very unlikely to deviate from what we really wanted when we wrote the goal?" Regarding that point, I'm not an expert, but I'll point to some discussion by experts.
There are, as you may have seen, lists of examples where real AI systems have done things completely different from what their designers were intending. For example, this talk, in the section on Goodhart's law, has a link to such a list. But from what I can tell, those examples never involve the designers specifying goals in natural language. (I'm guessing that specifying goals that way hasn't seemed even faintly possible until recently, so nobody's really tried it?)
Here's a recent paper by academic philosophers that seems supportive of your question. The authors argue that AGI systems that involve large language models would be safer than alternative systems precisely because they could receive goals written in natural language. (See especially the two sections titled "reward misspecification" -- though note also the last paragraph, where they suggest it might be a better idea to avoid goal-directed AI altogether.) If you want more details on whether that suggestion is correct, you might keep an eye on reactions to this paper. There are some comments on the LessWrong post, and I see the paper was submitted for a contest.
Well, I hope it's not impossible! If it is, we're in a pretty bad spot. But it's definitely true that we don't know how to do it, despite lots of hard work over the last 30+ years. To really get why this should be, you have to understand how AI training works in a somewhat low-level way.
Suppose we want an image classifier -- something that'll tell us whether a picture has a sheep in it, let's say. Schematically, here's how we build one:
I'm leaving out some mathematical details, but nothing that changes the overall picture.
All modern AI training works basically this way: start with random numbers, use them to do some task, evaluate their performance, and then tweak the numbers in a way that seems to point toward better performance.
Crucially, we never know why a change to particular parameter is good, just that it is. Similarly, we never know what the AI is "really" trying to do, just that whatever it's doing helps it do the task -- to classify the images in our training set, for example. But that doesn't mean that it's doing what we want. For example, maybe all the pictures of sheep are in big grassy fields, while the non-sheep pictures tend to have more trees, and so what we actually trained was an "are there a lot of trees?" classifier. This kind of thing happens all the time in machine learning applications. When people talk about "generalizing out of distribution", this is what they mean: the AI was trained on some data, but will it still perform the way we'd want on other, different data? Often the answer is no.
So that's the first big difficulty with setting terminal goals: we can't define the AI's goals directly, we just show it a bunch of examples of the thing we want and hope it learns what they all have in common. Even after we're done, we have no way to find out what patterns it really found except by experiment, which with superhuman AIs is very dangerous. There are other difficulties but this post is already rather long.
There is actually an impossibility argument. Even if you could robustly specify goals in AGI, there is another convergent phenonemon that would cause misaligned effects and eventually remove the goal structures.
You can find an intuitive summary here: https://www.lesswrong.com/posts/jFkEhqpsCRbKgLZrd/what-if-alignment-is-not-enough
Excellent explanation. It seems to me that this problem might be mitigated if we reworked AI's structure/growth so that it mimicked a human brain as closely as possible.