I’ve just spent the last three days reading Stuart Russell’s new book on AI safety, ‘Human Compatible’. To be fair I didn’t read continuously for three days, this is because the book rewards thoughtful pauses to walk or drink coffee, because it nurtures reflection about what really matters.
You see, Russell has written a book about AI for social scientists, that is also a book about social science for AI engineers, while at the same time providing the conceptual framework to bring us all ‘provably beneficial AI’.
‘Human Compatible’ is necessarily a whistle-stop tour of very diverse but interdependent thinking across computer science, philosophy and the social sciences and I am recommending that all AI practitioners, technology policymakers, and social scientists read it.
The problem
The key elements of the book are as follows:
- No matter how defensive some AI practitioners get, we need to all agree there are risks inherent in the development of systems that will outperform us
- Chief among these risks is the concern that AI systems will achieve exactly the goals that we set them, even if in some cases we’d prefer if they hadn’t
- Human preferences are complex, contextual, and change over time
- Given the foregoing, we must avoid putting goals ‘in the machine’, but rather build systems that consult us appropriately about our preferences.
Russell argues the case for all these points. The argument is informed by an impressive and important array of findings from philosophy, psychology, behavioural economics, and game theory, among other disciplines.
A key problem as Russell sees it, is that most present day technology optimizes a ‘fixed externally supplied objective’, but this raises issues of safety if the objective is not fully specified (which it can never be), and if the system is not easily reset (which is plausible for a range of AI systems).
The solution
Russell’s solution is that ‘provably beneficial AI’ will be engineered according to three guidelines:
- The machine’s only objective is to maximize the realization of human preferences
- The machine is initially uncertain about what those preferences are
- The ultimate source of information about human preferences is human behaviour
There are some mechanics that can be deployed to achieve such design. These include game theory, utilitarian ethics, and an understanding of human psychology. Machines must defer to humans regularly, ask permission, and their programming will explicitly allow for the machines to be wrong and therefore be open to being switched off.
Agree with Russell or disagree, he has provided a framework to which disparate parties can now refer, a common language and usable concepts accessible to those from all disciplines to progress the AI safety dialogue.
If you think that goals should be hard-coded, then you must point out why Russell’s warnings about fixed goals are mistaken. If you think that human preferences can always be predicted, then you must explain why centuries of social science research is flawed. And be aware that Russell preempts many of the inadequate slogan-like responses to these concerns.
I found an interesting passage late in the book where the argument is briefly extended from machines to political systems. We vote every few years on a government (expressing our preferences). Yet the government then acts unilaterally (according to its goals) until the next election. Russell is disparaging of this process whereby ‘one byte of information’ is contributed by each person every few years. One can infer that he may also disapprove of the algorithms of large corporate entities with perhaps 2 billion users acting autonomously on the basis of ‘one byte’ of agreement with blanket terms and conditions.
Truly ‘human compatible’ AI will ask us regularly what we want, and then provide that to us, checking to make sure it has it right. It will not dish up solutions to satisfy a ‘goal in the machine’ which may not align with current human interests.
What do we want to want?
The book makes me think that we need to be aware that machines will be capable of changing our preferences (we already experience this with advertising) and indeed machines may do so in order to more easily satisfy the ‘goals in the machine’ (think of online engagement and recommendation engines). It seems that we (thanks to machines) are now capable of shaping our environment (digital or otherwise) in such a way that we can shape the preferences of people. Ought this be allowed?
We must be aware of this risk. If you prefer A to B, and are made to prefer B, then how is this permitted? As Russell notes, would it ever make sense for someone to choose to switch from preferring A to preferring B, given that they currently prefer A?
This point actually runs very deep and a lot more philosophical thought needs to be deployed here. If we can build machines that can get us what we want, but we can also build machines that can change what we want, then we need to figure out an answer to the following deeply thought-provoking question, posed by Yuval Noah Harari at the end of his book ‘Sapiens’: ‘What do we want to want?’ There is no dismissive slogan answer to this problem.
What ought intelligence be for?
In the present context we are using ‘intelligence’ to refer to the operation of machines, but in a mid-2018 blog I posed the question what ought intelligence be used for? The point being that we are now debating how we ought to deploy AI, but what uses of other kinds of intelligence are permissible?
The process of developing and confronting an intelligence other than our own is cause for some self-reflexive thought. If there are certain features and uses of an artificial intelligence that we wouldn’t permit, then how are we justified in permitting similar goals and methods of humans? If Russell’s claims that we should want altruistic AI have any force, then why do we permit non-altruistic human behaviour?
Are humans ‘human compatible’?
I put down this book agreeing that we need to control AI (and indeed we can, according to Russell, with good engineering). But if intelligence is intelligence is intelligence then must we necessarily turn to humans, and constrain them in the same way so that humans don’t pursue ‘goals inside the human’ that are significantly at odds with ‘our’ preferences?
The key here is defining ‘our’. Whose preferences matter? There is a deep and complex history of moral and political philosophy addressing this question, and AI developers would do well to familiarise themselves with key aspects of it. As would corporations, as would policymakers. Intelligence has for too long been used poorly.
Russell notes that many AI practitioners strongly resist regulation and may feel threatened when non-technical influences encroach on ‘their’ domain. But the deep questions above, coupled with the risks inherent due to ‘goals in the machine’, require an informed and collaborative approach to beneficial AI development. Russell is an accomplished AI practitioner speaking on behalf of philosophers to AI scientists, but hopefully this book will speak to everyone.
The view I would advocate is that something like utilitarianism (i.e., some form of impartial, species-indifferent welfare maximization) is a core part of human values. What I mean by 'human values' here isn't on your list; it's closer to an idealized version of our preferences: what we would prefer if we were smarter, more knowledgeable, had greater self-control.
The language of "human-compatible" is very speciesist, since ethically we should want AGI to be "compatible" with all moral patients, human or not.
On the other hand, the idea of using human brains as a "starting point" for identifying what's moral makes sense. "Which ethical system is correct?" isn't written in the stars or in Plato's heaven; it seems like if the answer is encoded anywhere in the universe, it must be encoded in our brains (or in logical constructs out of brains).
The same is true for identifying the right notion of "impartial", "fair", "compassionate", "taking other species' welfare into account", etc.; to figure out the correct moral account of those important values, you would primarily need to learn facts about human brains. You'd then need to learn facts about non-humans' brains in order to implement the resultant impartiality procedure (because the relevant criterion, "impartiality", says that whether you have human DNA is utterly irrelevant to moral conduct).
The need to bootstrap from values encoded in our brains doesn't and shouldn't mean that humans are the only moral patients (or even that we're particularly important moral patients; insects could turn out to be utility monsters, for all we know today). Hence "human-compatible" is an unfortunate phrase here.
But it does mean that if, e.g., it turns out that cats' ultimate true preferences are to torture all species forever, we shouldn't give that particular preference equal decision weight. Speaking very loosely, the goal is more like 'ensuring all beings gets to have a good life', not like 'ensuring all species (however benevolent or sadistic they turn out to be) get an equal say in what kind of life all beings get to live'.
If there's a more benevolent species than humans, I'd hope that sufficiently advanced science could identify that species, and pass the buck to them. (In an odd sense, we're already building an alien species to defer to if we're constructing 'an idealized version of human preferences', since I would expect sufficiently idealized preferences to turn out to be pretty alien compared to the views human beings espouse today.)
I think it's reasonable to worry that given humans' flaws, humans might not in fact build AGI that 'ensures all beings get to have a good life'. But I do think that something like the latter is the goal; and when you ask me what physical facts in the world make that 'the goal', and what we would need to investigate in order to work out all the wrinkles and implementation details, I'm forced to initially point to facts about human (if only to identify the right notions of 'what a moral patient is' and 'how one ought to impartially take into account all moral patients' welfare').