Summary
I believe that advanced AI systems will likely be aligned with the goals of their human operators, at least in a narrow sense. I’ll give three main reasons for this:
- The transition to AI may happen in a way that does not give rise to the alignment problem as it’s usually conceived of.
- While work on the alignment problem appears neglected at this point, it’s likely that large amounts of resources will be used to tackle it if and when it becomes apparent that alignment is a serious problem.
- Even if the previous two points do not hold, we have already come up with a couple of smart approaches that seem fairly likely to lead to successful alignment.
This argument lends some support to work on non-technical interventions like moral circle expansionor improving AI-related policy, as well as work on special aspects of AI safety like decision theory or worst-case AI safety measures.
I find it unfortunate that people aren't using a common scale for estimating AI risk, which makes it hard to integrate different people's estimates, or even figure out who is relatively more optimistic or pessimistic. For example here's you (Tobias):
Robert Wiblin, based on interviews with Nick Bostrom, an anonymous leading professor of computer science, Jaan Tallinn, Jan Leike, Miles Brundage, Nate Soares, Daniel Dewey:
Paul Christiano:
It seems to me that Robert's estimate is low relative to your inside view and Paul's, since you're both talking about failures of narrow alignment ("intent alignment" in Paul's current language), while Robert's "serious catastrophe caused by machine intelligence" seems much broader. But you update towards much higher risk based on "other thoughtful people" which makes me think that either your "other thoughtful people" or Robert's interviewees are not representative, or I'm confused about who is actually more optimistic or pessimistic. Either way it seems like there's some very valuable work to be done in coming up with a standard measure of AI risk and clarifying people's actual opinions.
Great point – I agree that it would be value to have a common scale.
I'm a bit surprised by the 1-10% estimate. This seems very low, especially given that "serious catastrophe caused by machine intelligence" is broader than narrow alignment failure. If we include possibilities like serious value drift as new technologies emerge, or difficult AI-related cooperation and security problems, or economic dynamics riding roughshod over human values, then I'd put much more than 10% (plausibly more than 50%) on something not going well.
Regarding ... (read more)