Summary
I believe that advanced AI systems will likely be aligned with the goals of their human operators, at least in a narrow sense. I’ll give three main reasons for this:
- The transition to AI may happen in a way that does not give rise to the alignment problem as it’s usually conceived of.
- While work on the alignment problem appears neglected at this point, it’s likely that large amounts of resources will be used to tackle it if and when it becomes apparent that alignment is a serious problem.
- Even if the previous two points do not hold, we have already come up with a couple of smart approaches that seem fairly likely to lead to successful alignment.
This argument lends some support to work on non-technical interventions like moral circle expansionor improving AI-related policy, as well as work on special aspects of AI safety like decision theory or worst-case AI safety measures.
In this comment I engage with many of the object-level arguments in the post. I upvoted this post because I think it's useful to write down these arguments, but we should also consider the counterarguments.
(Also, BTW, I would have preferred the word "narrow" or something like it in the post title, because some people use "alignment" in a broad sense and as a result may misinterpret you as being more optimistic than you actually are.)
If the emergence of AI is gradual enough, it does seem that safety issues can be handled adequately, but even many people who think "soft takeoff" is likely don't seem to think that AI will come that slowly. To the extent that AI does emerge that slowly, that seems to cut across many other AI-related problem areas including ones mentioned in the Summary as alternatives to narrow alignment.
Also, distributed emergence of AI is likely not safer than centralized AI, because an "economy" of AIs would be even harder to control and harness towards human values than a single or small number of AI agents. An argument can be made that AI alignment work is valuable in part so that unified AI agents can be safely built, thereby heading off such a less controllable AI economy.
So it does not seem like "distributed" by itself buys any safety. I think our intuition that it does probably comes from a sense that "distributed" is correlated with "gradual". If you consider a fast and distributed rise of AI, does that really seem safer than a fast and centralized rise of AI?
This assumes that alignment work is highly parallelizable. If it's not, then doing more alignment work now can shift the whole alignment timeline forward, instead of just adding to the total amount of alignment work in a marginal way.
This only applies to short-term "alignment" and not to long-term / scalable alignment. That is, I have an economic incentive to build an AI that I can harness to give me short-term profits, even if that's at the expense of the long term value of the universe to humanity or human values. This could be done for example by creating an AI that is not at all aligned with my values and just giving it rewards/punishments so that it has a near-term instrumental reason to help me (similar to how other humans are useful to us even if they are not value aligned to us).
I have an issue with "approaches" (plural) here because as far as I can tell, everyone is converging to Paul Christiano's iterated amplification approach (except for MIRI which is doing more theoretical research). ETA: To be fair, perhaps iterated amplification should be viewed as a cluster of related approaches.
I think we ourselves don't know how to reliably distinguish between "attempts to manipulate" and "attempts to help" so it would be hard to AIs to learn this. One problem is, our own manipulate/help classifier was trained on a narrow set of inputs (i.e., of other humans manipulating/helping) and will likely fail when applied to AIs due to distributional shift.
Same problem here, our own understanding of what it means to be a helpful assistant to somebody likely isn't robust to distributional shifts. I think this means we actually need to gain a broad/theoretical understanding of "corrigibility" or "helping" instead of being able to have AIs just learn it from humans.
Thanks for the detailed comments!
Good point – changed the title.
As long as we consider onl... (read more)