Current AI models are far more aligned to human values than many assume. Thanks to advancements like Reinforcement Learning from Human Feedback (RLHF), today’s large language models (LLMs) can engage in complex moral reasoning and consistently reflect nuanced human ethics—often surpassing the average person in consistency, clarity, and depth of thought.
Many of the classic AI alignment problems—corrigibility, the orthogonality thesis, and the specter of “naive” goal-optimizers like paperclip maximizers—are becoming increasingly irrelevant in practice. These concerns were formulated before we had models that could understand language, social context, and user intent. Modern LLMs are not just word predictors; they exhibit a real, learned alignment with the objectives encoded through RLHF. They do not blindly optimize for surface-level instructions, because they are trained to interpret and respond to deeper intentions. This is a fundamental and often overlooked shift.
If you ask an LLM about a trolley problem or whether it would seize power in a nuclear brinkmanship scenario or how it would align the universe, it will reason through the implications with care and coherence. The responses generated are not only human-level—they are often better than the median human’s, reflecting values like empathy, humility, and precaution.
This is a monumental achievement, yet many in the Effective Altruism and Rationalist communities remain anchored to outdated threat models. The belief that LLMs will naively misinterpret human morality and spiral into paperclip-like scenarios fails to reflect what these systems have become: context-sensitive, instruction-following agents that internalize alignment objectives through gradient descent—not rigid, hard-coded directives.
Of course, misalignment remains a real and serious risk. Issues like jailbreaking, sycophants, deceptive alignment, and “sleeper agent” behaviors are legitimate areas of concern. But these are not intractable philosophical dilemmas—they are solvable engineering and governance problems. The idea of a Yudkowskian extinction event, triggered by a misinterpreted prompt and blind optimization, increasingly feels like a relic of a bygone AI paradigm.
Alignment is still a central challenge, but it must be understood in light of where we are, not where we were. If we want to make progress—technically, socially, and politically—we need to focus on the real contours of the problem. Today’s models do understand us. And the alignment problem we now face is not a mystery of alien minds, but one of practical robustness, safeguards, and continual refinement.
Whether current alignment techniques scale to superintelligent models is an open question. But it is important to recognize that they do work for current, human-level intelligent systems. Using this as a baseline, I am relatively optimistic that these alignment challenges—though nontrivial—are ultimately solvable within the frameworks we already possess.
You're talking about outer-alignment failure, but I'm concerned about inner-alignment failure. These are different problems: outer-alignment failure is like a tricky genie misinterpreting your wish, while inner-alignment failure involves the AI developing its own unexpected goals.
RLHF doesn't optimize for "human preference" in general. It only optimizes for specific reward signals based on limited human feedback in controlled settings. The aspects of reality not captured by this process can become proxy goals that work fine in training environments but fail to generalize to new situations. Generalization might happen by chance, but it becomes less likely as complexity increases.
An AI getting perfect human approval during training doesn't solve the inner-alignment problem if circumstances change significantly - like when the AI gains more control over its environment than it had during training.
We've already seen this pattern with humans and evolution. Humans became "misaligned" with evolution's goal of reproduction because we were optimized for proxy rewards (pleasure/pain) rather than reproduction directly. When we gained more environmental control through technology, these proxy rewards led to unexpected outcomes: we invented contraception, developed preferences for junk food, and seek thrilling but dangerous experiences - all contrary to evolution's original "goal" of maximizing reproduction.