I’ve just spent the last three days reading Stuart Russell’s new book on AI safety, ‘Human Compatible’. To be fair I didn’t read continuously for three days, this is because the book rewards thoughtful pauses to walk or drink coffee, because it nurtures reflection about what really matters.
You see, Russell has written a book about AI for social scientists, that is also a book about social science for AI engineers, while at the same time providing the conceptual framework to bring us all ‘provably beneficial AI’.
‘Human Compatible’ is necessarily a whistle-stop tour of very diverse but interdependent thinking across computer science, philosophy and the social sciences and I am recommending that all AI practitioners, technology policymakers, and social scientists read it.
The problem
The key elements of the book are as follows:
- No matter how defensive some AI practitioners get, we need to all agree there are risks inherent in the development of systems that will outperform us
- Chief among these risks is the concern that AI systems will achieve exactly the goals that we set them, even if in some cases we’d prefer if they hadn’t
- Human preferences are complex, contextual, and change over time
- Given the foregoing, we must avoid putting goals ‘in the machine’, but rather build systems that consult us appropriately about our preferences.
Russell argues the case for all these points. The argument is informed by an impressive and important array of findings from philosophy, psychology, behavioural economics, and game theory, among other disciplines.
A key problem as Russell sees it, is that most present day technology optimizes a ‘fixed externally supplied objective’, but this raises issues of safety if the objective is not fully specified (which it can never be), and if the system is not easily reset (which is plausible for a range of AI systems).
The solution
Russell’s solution is that ‘provably beneficial AI’ will be engineered according to three guidelines:
- The machine’s only objective is to maximize the realization of human preferences
- The machine is initially uncertain about what those preferences are
- The ultimate source of information about human preferences is human behaviour
There are some mechanics that can be deployed to achieve such design. These include game theory, utilitarian ethics, and an understanding of human psychology. Machines must defer to humans regularly, ask permission, and their programming will explicitly allow for the machines to be wrong and therefore be open to being switched off.
Agree with Russell or disagree, he has provided a framework to which disparate parties can now refer, a common language and usable concepts accessible to those from all disciplines to progress the AI safety dialogue.
If you think that goals should be hard-coded, then you must point out why Russell’s warnings about fixed goals are mistaken. If you think that human preferences can always be predicted, then you must explain why centuries of social science research is flawed. And be aware that Russell preempts many of the inadequate slogan-like responses to these concerns.
I found an interesting passage late in the book where the argument is briefly extended from machines to political systems. We vote every few years on a government (expressing our preferences). Yet the government then acts unilaterally (according to its goals) until the next election. Russell is disparaging of this process whereby ‘one byte of information’ is contributed by each person every few years. One can infer that he may also disapprove of the algorithms of large corporate entities with perhaps 2 billion users acting autonomously on the basis of ‘one byte’ of agreement with blanket terms and conditions.
Truly ‘human compatible’ AI will ask us regularly what we want, and then provide that to us, checking to make sure it has it right. It will not dish up solutions to satisfy a ‘goal in the machine’ which may not align with current human interests.
What do we want to want?
The book makes me think that we need to be aware that machines will be capable of changing our preferences (we already experience this with advertising) and indeed machines may do so in order to more easily satisfy the ‘goals in the machine’ (think of online engagement and recommendation engines). It seems that we (thanks to machines) are now capable of shaping our environment (digital or otherwise) in such a way that we can shape the preferences of people. Ought this be allowed?
We must be aware of this risk. If you prefer A to B, and are made to prefer B, then how is this permitted? As Russell notes, would it ever make sense for someone to choose to switch from preferring A to preferring B, given that they currently prefer A?
This point actually runs very deep and a lot more philosophical thought needs to be deployed here. If we can build machines that can get us what we want, but we can also build machines that can change what we want, then we need to figure out an answer to the following deeply thought-provoking question, posed by Yuval Noah Harari at the end of his book ‘Sapiens’: ‘What do we want to want?’ There is no dismissive slogan answer to this problem.
What ought intelligence be for?
In the present context we are using ‘intelligence’ to refer to the operation of machines, but in a mid-2018 blog I posed the question what ought intelligence be used for? The point being that we are now debating how we ought to deploy AI, but what uses of other kinds of intelligence are permissible?
The process of developing and confronting an intelligence other than our own is cause for some self-reflexive thought. If there are certain features and uses of an artificial intelligence that we wouldn’t permit, then how are we justified in permitting similar goals and methods of humans? If Russell’s claims that we should want altruistic AI have any force, then why do we permit non-altruistic human behaviour?
Are humans ‘human compatible’?
I put down this book agreeing that we need to control AI (and indeed we can, according to Russell, with good engineering). But if intelligence is intelligence is intelligence then must we necessarily turn to humans, and constrain them in the same way so that humans don’t pursue ‘goals inside the human’ that are significantly at odds with ‘our’ preferences?
The key here is defining ‘our’. Whose preferences matter? There is a deep and complex history of moral and political philosophy addressing this question, and AI developers would do well to familiarise themselves with key aspects of it. As would corporations, as would policymakers. Intelligence has for too long been used poorly.
Russell notes that many AI practitioners strongly resist regulation and may feel threatened when non-technical influences encroach on ‘their’ domain. But the deep questions above, coupled with the risks inherent due to ‘goals in the machine’, require an informed and collaborative approach to beneficial AI development. Russell is an accomplished AI practitioner speaking on behalf of philosophers to AI scientists, but hopefully this book will speak to everyone.
Russels' assumption that "The machine’s only objective is to maximize the realization of human preferences" seems to assume some controversial and (to my judgement) highly implausible moral views. In particular, it is speciesistic, for why should only human preferences be maximized? Why not animal or machine preferences?
One might respond that Russel is giving advice to humans and humans should maximize human preferences, since we should all maximize our own preferences. Thus, he isn't assuming that there is anything morally special about humans and his position is therefore not speciestic. I respond, that maximizing my own prefrences and maximizing human preferences are very different objectives, since there are many humans other than myself. This defence therefore rests on a mischaracterization of Russel's assumption (at least as you outlined it). Furthermore, the assumption that we should maximize our own preferences seems anyway arbitrary and unsurported.
You write that "There are some mechanics that can be deployed to achieve [an AI following the guidelines]. These include game theory, utilitarian ethics, and an understanding of human psychology."
I doubt that a utilitarian ethic is useful for maximizing of human preferences, since utilitarianism is impartial in the sense that it takes everyone's wellbeing into account, human or otherwise. I also doubt that it supports the maximization of the agent's own preferences, where "the agent" is assumed to be an individual human, since human preferences have non-utilitarian features. The precise nature of these features depends on what exactly you mean by "preference," so let me illustrate the point with some sensible-sounding definitions of "preference".
(A) An agent is said to prefer x over y, iff he would choose the certain outcome x over the certain outcome y, when given the option.
This makes it tautological that agents maximizes their preferences, when the necessary factual information is availeble. However, people often behave in non-utilitarian ways even if they posses all the relevant factual information. They may e.g. use their money on luxeries instead of donations, or they may support factory farming by buying its products.
(B) An agent is said to prefer x over y, iff he has an urge/craving towards doing x instead of doing y. To put it in other words, the agent would have to muster some strength of will, if he is to avoid doing x instead of y.
People's cravings/urges can often lead them in non-utilitarian directions (think e.g. of a drug addict who would be better of he could muster the will to quit the drugs).
(C) An agent is said to prefer x over y, iff the feelings/emotions/passions that motivate him towards x are more intense, than those which motivate him towards y. The intensity is here assumed to be some consciously felt feature of the feelings.
Warm glow giving is, by definition, motivated by our feelings/emotions. However, it usually has fairly little impact upon aggragate happiness, so uttilitarianism doesn't recommend it.
(D) An agent is said to prefer x over y, iff he values x more than y.
This definition prompts the question "what does 'valuing' refer to?". One possible answer is to define "valuing" like (C), but (C) has already been dealt with. Another option is the following.
(E) An agent values is x more than y, iff he believes it to be more valuable.
This would make preference-maximization compatible with uttilitarianism, insofar as the agent believes in utilitarism and lacks beliefs that contradict utilitarianism. However, it would also be compatible with any other moral theory whatsoever, so long as we make the analogous assumptions on behalf of that theory.
It seems worth adding two more comments about (E). First, unlike (A), (B) and (C) it introduces a rationale for maximizing one's prefernces. We cannot act on an unknown truth, but only on what we believe to be true. Thus, we must act on our moral beliefs, rather than some unknown moral truth.
Second, (E) seems like a bad analysis of "preference," for although moral views have some preference-like features (specifically, they can motivate behavior), they also have some features, that are more belief-like, than preference-like. They can e.g. serve as premises or conclusions in arguments, one can have credences in them and they can be the subjectmatter of questions.
I agree that suffering is bad in all universes, for the reasons described in https://www.lesswrong.com/posts/zqwWicCLNBSA5Ssmn/by-which-it-may-be-judged. I'd say that "ethics... is not constituted by any feature of the universe" in the sense you note, but I'd point to our human brains if we were asking any question like: