Note: This post was sparked by some replies to my EA Forum post Sept 15 about ‘The religion problem in AI alignment’. There, I argued that AI safety researchers need to explicitly consider religious values when they think about value alignment. I realized that I should have included an argument about why alignment research should consider distinctive types of human values (e.g. religious values), rather than treating all values as homogenous elements that can be incorporated into some master utility function. This post develops a preliminary argument about the importance of considering many distinct types of human values; it should probably be read before the religion post.
Introduction
I worry that a lot of AI alignment research seems to rely on a dangerously simplistic view of human values, and that this will undermine our ability to safely align AI systems with human values.
The simplistic view seems to arise from taking expected utility theory too seriously as a model of human values and preferences. It’s true that we can often describe human decisions, post hoc, at a rather abstract and generic level, using the language of utility theory, Bayesian rationality, and statistical decision theory. This rather abstract and generic way of modeling human values has often been useful in the fields of economics, game theory, rational choice, consequentialist moral philosophy, and reinforcement learning theory.
However, within standard utility theory, there’s no fundamental difference between a consumer’s food preference for a certain flavor of jelly bean and a Muslim’s sacred taboo against eating pork. Utility theory doesn’t distinguish very well between someone who’s a vegan for health reasons, someone who’s a vegan for ethical reasons, and someone who has food allergies to animal proteins. Utility theory doesn’t distinguish very well between someone who’s polyamorous based on libertarian principles, someone who’s polyamorous so they can conform to their Burning Man peer group, and someone who just happened to inherit genes for a high degree of ‘sociosexuality’. Utility theory can’t even distinguish very well between deontologists, consequentialists, virtue ethicists, and religious fundamentalists.
With all due respect to utility theory as a normative theory, if we take it as a descriptive account of human psychology, it seems blind to the heterogeneity of human values types. It can’t model the complex architecture of human values. It can’t model the differences between values that are implemented in different psychological mechanisms such as reflexes, emotions, motivations, cognitive biases, learning biases, conscious preferences, implicit preferences, virtue signals, social norms, political attitudes, sacred values, and taboos. It can’t understand that specific human values fit into categories of value types that have different implications for learning, generalization, inference, and decision-making.
If these differences in types of values matter, at all, in any way, then AI alignment might need a richer model of human values than standard utility theory can offer. Or it might not. We can’t really tell until we think seriously about the heterogeneity of human value types – the whole range of different types of preferences, emotions, motivations, norms, and goals that we might want AI systems to become aligned with. (Note that I’m not talking about heterogeneity of values across individuals, groups, cultures, or historical eras; I’m talking about heterogeneity of different types of values within each human individual – e.g. the difference between a food preference, a religious taboo, and a sexual kink.)
To get a sense of that range of human value types, it might be helpful to start by considering the range of academic disciplines that study different kinds of human values.
How many kinds of human values are there?
Ever since Herbert Spencer published The principles of psychology in 1855, psychologists have studied human values, preferences, emotions, and motivations. We’ve had more than 160 years of research on these topics, with especially fruitful eras around 1870-1920 (after Darwin, before Behaviorism), and 1970 to now (after Behaviorism gave way to richer new fields like cognitive science, emotion research, and evolutionary psychology).
Currently, at least 30 subfields of psychology study different types of values, in different domains, that vary across individuals, groups, and contexts in different ways:
- Affective neuropsychology studies the neural basis of values, emotions, and motivations
- Applied behavior analysis studies how values guide learning and reinforce behavior
- Behavior genetics studies the heritability of values, emotions, and motivations
- Clinical psychology studies differences in values and emotions across mental disorders
- Comparative psychology studies differences in values and motivations across species
- Consumer psychology studies consumer preferences
- Cross-cultural psychology studies differences in values across cultures
- Developmental psychology studies changes in values across the life span
- Economic psychology studies economic values and motivations
- Educational psychology studies educational values and learning motivations
- Evolutionary psychology studies evolved values, emotions, and motivations
- Forensic psychology studies behaviors that violates legal values
- Food psychology studies food preferences
- Health psychology studies health values
- Human factors studies values that are important in human/machine interaction
- Industrial/organizational psychology studies work and career preferences
- Judgment and decision-making research studies rationality values
- Media psychology studies how mass media and social media influence values
- Moral psychology studies moral values
- Personality psychology studies differences in values across personality traits
- Psycholinguistics studies communication values and norms
- Social psychology studies social values, emotions, and motivations
- Sex research studies sexual preferences, motivations, and kinks
- Sports psychology studies athletic values and competitive motivations
- Physiological psychology studies the neural and hormonal basis of values and emotions
- Political psychology studies political values
- Positive psychology studies general well-being and flourishing
- Psychology of art studies aesthetic values and creative motivations
- Psychology of religion studies sacred religious values and spiritual motivations
- Psychopharmacology studies how drugs influence values, emotions, and motivations
Each of these fields includes thousands of researchers, tens of thousands of journal papers, hundreds of academic books, and dozens of textbooks. Each is taught in thousands of psychology courses around the world. And each discovers new things about distinctive types of values.
Beyond psychology, many other academic fields study human values, preferences, and motivations. These include social sciences such as anthropology, economics, political sciences, and sociology. They include humanities such as philosophy, history, literature, art history, and ethnic studies. They include professional schools such as law schools, medical schools, business schools, and religious schools. Each of these also includes thousands of researchers, classes, and books. In fact, very few academic fields have nothing to say about human values, and nothing to contribute to our understanding of human values.
Do AI alignment researchers really need to learn about the heterogeneity of human value types?
If you’re an AI safety researcher, you might be thinking, dude, do you really expect us to master 30 subfields of psychology and dozens of other academic disciplines just to make sure that our AI systems can align with this alleged variety of human value types? Is this just a stratagem for enforcing a Long Reflection, a grand detour through the social sciences and humanities, that would delay AI research by a couple of centuries?
You might also be thinking, sure, academic fields often split into more specialized subfields so people can publish stuff and teach new courses to get tenure. There are academic incentives to hype the distinctiveness of one’s field, and to make it seem relevant to students and funders by emphasizing its relevance to understanding human values and concerns.
Why should we take all this value-scholarship and value-science seriously in AI alignment concerning human values? Why do we need anything beyond the two standard theoretical foundations of Effective Altruism -- normative consequentialist moral philosophy and expected utility theory -- to descriptively understand the variety of human values?
Couldn’t AI systems just extract heterogenous value types from human behavioral data?
Let me steel-man the case against AI alignment research needing to pay any attention to previous research on human values. Maybe a sufficiently powerful algorithm can reinvent everything that academics have learned about human value types over the last few centuries.
Imagine AI engineers develop a ‘deep value learning’ algorithm. You can feed it a firehose of data about human preferences and behaviors. You feed the system every book ever written in any language. Every EA blog. Every 80,000 Hours podcast. Every movie. Every YouTube video. Every PornHub video. Every surveillance video. Every social media post. Every Google query. Every consumer purchase. Every election vote. Every political speech and religious sermon ever recorded.
Call this the ‘digital value corpus’. It includes a few thousand exabytes of data. It will serve as the input data for value learning.
The algorithm does some kind of colossally powerful unsupervised learning that can statistically extract the complex architecture of human values given this value corpus. Maybe it doesn’t need any supervised learning or reinforcement learning. Maybe the heterogeneity of human value types is all there, latent in the data, ready to be extracted, modelled, and used for alignment.
I suspect that, if human values have causal effects on human behavior, communication, and interaction, then any sufficiently large and rich ‘digital value corpus’ of human behavior, communication, and interaction will include enough latent patterning that a superintelligent AI could – in principle – extract and model the full architecture and heterogeneity of our human values. (This would be functionally equivalent to superintelligent aliens inferred the entire value architecture of humanity just be observing everything that happens on our planet – including all visible behavior and all electronic traffic.)
Maybe the AI can reinvent every insight into human values that has ever emerged from scholars and scientists over the last few millennia. So, maybe we can ignore all of their work. After all, the scholars and scientists were just observing and abstracting about human values given the tiny slices of human behavioral data that they could access (including their fallible introspections), given their cultural biases, ideologies, and top-down models. Why would we trust their human insights more than an AI’s statistical model abstracted from a much larger and more comprehensive value corpus? When modelling the heterogeneity of human values, couldn’t an AI out-perform human scientists, in the same way that AlphaGo, fed with a huge historical corpus of previously played go games, could outperform human go masters?
One problem is, how would we know whether the AI had modelled human values accurately? Could the AI explain our value categories and value architecture to us in a way that we would understand? Would its models of our values be intelligible and interpretable? Could it really fold those values into its own decision-making systems in a way that we could trust? Or would we need to give it a lot more feedback through supervision or reinforcement learning, and a lot more tests to make sure its value architecture was both well-modelled and well-aligned with ours?
Why would heterogenous human value types matter to an AI?
To think more clearly about whether we need to pay attention to the heterogeneity of human value types in AI alignment, we could ask, what computational difference would value type heterogeneity really make to the AI system? Why put specific values into categories that correspond to our human value types?
Maybe the AI just needs to know how much importance or weight we attach to each value or preference, and that’s all it needs to make decisions aligned with our preferences, using standard expected utility calculations. Apart from the decision weight attached to the value, why would it matter what type of value it is – e.g. whether the value is a food preference, a religious taboo, a sexual kink, an aesthetic taste, or a career aspiration?
One simple thought experiment is to imagine a working mother trying to teach a new domestic AI system her values and preferences. She makes verbal statements about things she likes and doesn’t like, and the AI listens and learns. She likes chocolate croissants, she likes cobalt blue silk dresses, she likes for her baby to be safe, she likes promotions at work, she likes shibari ropes, she likes criminal justice reform, and she likes going to Mass on Sundays. Now, would it add any useful information if she said – or if the AI inferred – that these ‘likes’ represent different types of values – specifically, a food preference, a fashion preference, a parental safety preference, a career ambition, a sexual kink, a political value, and a religious value?
In other words, would knowing that a specific human preference falls into a particular value type help an AI do more effective learning, generalization, inference, and decision-making? Well, 60 years of cognitive psychology research suggest that doing better learning, generalization, inference, and decision-making is the entire point of putting things into mental categories. Categorization helps computation. Therefore, good categorization of value types should help AI alignment. That’s the general argument for why AI systems should pay attention to value types, and why AI alignment researchers should too.
But that’s pretty vague. Are there any more specific arguments for why AI systems would work better if they understood different value types? Here are a few specific ways that value types differ from each other in ways that AI systems might find computationally relevant.
Computationally relevant differences across value types
Tradeoffs. People tend to treat some value types (e.g. religious commandments, wedding vows) as ironclad deontological imperatives that are not open to cost/benefit reasoning or tradeoffs, whereas they treat other value types (e.g. Netflix movie choices, hotel preferences) as relatively trivial, transient, and superficial, and fully subject to tradeoffs against other values. If an AI understands the typical degrees of tradeoff flexibility within each value type, it’s likely to be more aligned with human values.
Correlations across value types. Some value types allow stronger inference about other values in other categories. For example, many traditional religions have surprisingly strong food taboos that create bridges between religious values and food preferences – so if the AI knows that someone is an Orthodox Jew or a devout Muslim, it can make inferences about their likely food preferences, but not necessarily about their movie preferences. On the other hand, visual aesthetic preferences (e.g. for Art Nouveau architecture) may not correlate very much with musical preferences (e.g. for Nordic folk metal). AI systems that understand the architecture of correlations across value types might make more accurately calibrated inferences across specific values.
Virtue signaling. Some value types tend to involve a high degree of authenticity – a high correlation between stated and revealed preferences, and a low degree of deception, hypocrisy, or virtue signaling. We don’t tend to virtue signal very much about travel logistics, such as whether we prefer a window or an aisle seat on flights, or whether we prefer an ocean or a garden view in hotels. Other value types tend to involve a lot more public signaling of socially rewarded values, but a lot more private hypocrisy. Political and religious virtue-signaling is famously important to humans. AI systems might make better predictions about our values and preferences if they understand the typical degrees, types, and channels of virtue signaling involved in each value type. They might also understand which of our hypocrisies can be quietly noted, but should not be mentioned out loud – lest we feel embarrassed, angry, and outraged at the AI.
Heritability. All values studied so far in behavioral genetics show some degree of heritability. Genetic differences between people within a culture account for some of the phenotypic differences in their values within that culture. Siblings tend to be more similar in their values than cousins do, for partly genetic reasons. But different value types tend to have different heritabilities. If an AI understands this, it can make better generalizations and inferences across relatives about their likely values. If the AI also has access to genomic data for the people it’s interacting with, it could use polygenic scores to infer some of their value types more easily than other value types.
Cultural transmission pattern. Many values are culturally transmitted – ‘vertically’ across generations, and/or ‘horizontally’ within generations. But different value types tend to be transmitted in different ways that allow different predictions about their commonality across people, families, and subcultures, their likely longevity over time, whether people treat them as moral imperatives or whimsical preferences, etc. Compare the spread of food cuisines, art styles, clothing fashions, mating norms, political ideals, and religious rituals. It might help an AI to understand which value types tend to have which kind of cultural transmission dynamics.
Lifespan development. Some value types tend to change quickly as humans grow up, go through different life stages, and get older; others tend to be more stable over time. Aversions to some ‘disgusting’ foods might get locked in by age 10, whereas preferences for certain cuisines might continue to develop throughout middle age. Sexual orientation might become relatively stable by age 20, whereas preferences for specific traits in a mate might change year by year. Religious beliefs might go through a period of instability in adolescence, and then settle down after marriage. It might help an AI to understand which value types are likely to change over each life-stage.
OK. Those are just six kinds of differences across value types that might be computationally (and ethically) relevant to AI systems. There are probably many other differences that could be explored in the future.
When testing AI alignment, we want to make sure that an AI system is aligned across all relevant types of human values. If we don’t explicitly list the value types that matter, we might overlook some important categories. And if value types involve different kinds of tradeoffs, correlations with other values, virtue signals, heritabilities, cultural transmission patterns, and lifespan development patterns, then an AI that looks aligned on some value types might not be aligned on other value types that haven’t been tested yet. Just because we can train an AI system to learn and embody our food preferences and movie preferences does not mean that it can learn and embody our sexual, political, or religious preferences. (This was one motivation for me to write my EA Forum post on religious values.) So we have to make sure we explicitly test performance and safety across all the value types. This is an important, practical, methodological issue in AI safety.
For the moment, I hope that this essay has helped make the case that AI alignment needs to take seriously the rich heterogeneity of human value types, the challenges of modeling our value architecture, and the importance of making sure that AI systems are aligned across all the kinds of values that really matter to us.
I really appreciate the series of posts you have been making! Keep them coming!
Ekka -- thanks! I have a couple more in the pipeline for next week. Will try to include clearer tdlr/abstract up front.