Cross-posted to the Alignment Forum/Less Wrong
Introduction
How would you go about scientifically studying aliens? Arik Kershenbaum’s The Zoologist Guide to The Galaxy proposes to use evolutionary thinking to uncover constraints on how alien species could evolve. One of his most interesting points is that evolution constrains function far more than form, because function depends significantly less on the details of the environment. Hence we should expect crisper answers to “How would aliens behave?” than “What would aliens look like?”. And in the course of his book, he gives the best answer he can find to the former question.
So when confronted with the question of how to study something he couldn’t gather data on, Kershenbaum leveraged analogies to biological systems he could and had studied, and the underlying constraints brought on by the mechanisms of natural selection.
On a completely unrelated note, the new summer fellowship Principles of Intelligent Behavior in Biological and Social Systems (PIBBSS) (funded by the LTFF) aims at creating valuable AI alignment research through studying analogies to many complex systems (evolution, brains, language, social structures…). Fellows will have graduate research experience in fields studying such systems, working on a concrete alignment project in collaboration with an established alignment researcher. The fellowship will run during all of Summer 2022.
The point of this post is to introduce this fellowship, explain the reasoning behind it and give more concrete details about how it will go. Note that I’m not an organizer of this fellowship, I’m just assisting with the writing of this post; credits for the ideas and arguments should go to Nora Ammann and TJ, the organizers of the fellowship.
Analogies as General Epistemic Strategies for Alignment
As I’ve written elsewhere, alignment cannot directly leverage most epistemic strategies and approaches used in Science and Engineering, because it’s about solving a problem that doesn’t exist yet on a technology we still have to invent.
One epistemic strategy that survives this major problem is the leveraging of analogies with existing biological or social systems that implement complex or intelligent behavior. Consider how a wide variety of such systems (from biology, physics, linguistics and other fields) exhibit similar properties: adaptation, robustness, goal-directed behavior, learning, embeddedness, modularity, phase transitions, and more. Since AI research focuses on mechanisms that lead to complex and intelligent behavior with many of these properties, careful analogies with these complex systems may allow us to transfer knowledge about all these behaviors and properties to the study of alignment and AGI.
Also note that the other main epistemic strategy used in alignment, figuring things out from first principles, can and often does take inspiration from other existing systems like evolution, brains and languages.
Examples of Successful Analogies in Alignment
If analogical thinking is such a valuable epistemic approach to alignment, we should find ample examples of valuable alignment research using such analogies. And that’s indeed what we see.
- Risks from Learned Optimization by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant explicitly uses biological evolution of humans as an example of/inspiration for mesa-optimizers. Evolution here is the search process, with a base-objective of increasing fitness and propagation of genes, but the learned models (humans) end up doing search for different goals.
- What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) by Andrew Critch explicitly builds on the concept of fields from sociology, and the structure and agency debate in the social sciences about the place of social structure vs personal choices in human behavior.
- Value loading in the human brain: a worked example by Steve Byrnes is one of many, many posts where Steve uses neuroscience as a source of analogies for alignment. Note that here the analogy is more literal and doing more work than in previous example, since Steve thinks that Brain-like AGI might be the first one to be created in the near future.
- Public Static: What is Abstraction? by John Wentworth (and really all his work on abstraction) uses many examples (transistors and logic gates, statistical mechanics and temperature, maps and city streets) to direct and explain his deconfusion of abstraction. This is a general trend, as John’s method starts with finding many different examples from distinct fields and thinking about what they have in common.
Of course, none of these works completely exploit the analogy, nor do they encompass all analogies relevant for alignment. The previous list just serves to illustrate that analogical thinking is an integral part of many examples of current alignment research.
The Problem: Difficulty of Epistemic Translation
If analogies to other systems already abound in alignment, what is the point of the fellowship?
Here, the concept of epistemic translation, as discussed by Nora Ammann, might give us a better idea of what it takes to make fruitful analogies. Fundamentally, linking system A with system B requires the creation of a translation between the two, a bridge faithful enough to let us transform insights about system B into ones about system A (for example, the analogy between magic and pizza fails this condition).
Exploiting analogies faithfully thus involves:
- Understanding the other field in detail
- Understanding alignment in detail
- Being able to see the powerful concepts and ideas of the other fields that share similarities with alignment and AGI
- Think through the analogy in enough details to check that it holds and is able to tell us something significant about alignment.
- Understand where and how a given analogies “breaks” and what that can tell us about alignment
Currently, alignment researchers are supposed to do all of that. Including building a deep expertise in the field they’re drawing from. With the exception of Steve Byrnes who did basically learn neuroscience for his research, most people don’t have the time to do that. As a consequence, they don’t find all valuable analogies for their work, or let them die in a drawer, or, if they do find and explore them, they might do so insufficiently or badly.
On the other hand, most people in fields that are ripe for analogies with alignment don’t know about the latter and don’t have any incentive to work on it. It’s also hard to get up to speed on alignment, especially when coming from a field outside computer science.
So alignment researchers want more varied and detailed analogies, and experts in other fields have the ability to help with these analogies and provide tools for studying the systems in question (and some could be interested by alignment as a challenging problem or a cause), but the current incentives and constraints makes it hard for the two sides to interface.
The PIBBSS Fellowship exists to bridge this gap by providing an institutional context for these collaborations.
Proposed Solution: Creating Institutional Context for Collaborations around Analogies
For the fellowship, alignment researchers get to propose projects related to analogies with biological or social systems that they would like to explore. Fellows with expertise in the corresponding field receive funding for the duration of the fellowship (12 weeks in Summer 2022) to collaborate with the alignment researcher on exploring the analogy and what it can bring to alignment.
Which fields are most promising? Well, fields don’t seem like the right granularity to discuss promising analogies here. Instead, the complex systems which are presented as analogical fit better (and they can be studied from different angles by different fields). Recall that for epistemic translation and analogies to be useful, there need to be insights, concepts and epistemic strategies for the analogous complex system to transfer over. So the systems most ripe for this work have been studied enough to gather a long tradition of results, appropriate epistemic hygiene, and, relative to the expected density of insight, they’re insufficiently represented in alignment.
This leads us to a tentative list that currently includes:
- Evolution and Ecology
- Brains
- Languages and Medias
- Social Structures and Institutions
- Engines
That said, the fellowship is open to other complex systems and fields that may have been overlooked at the moment, but share the properties that we care about (ie. insightfulness in existing literature and associated community, and relevance to AI alignment). In practice, the evaluation of the promisingness of a given analogy happens at the level of specific project proposals more so than at the level of entire disciplines.
Lastly, in some cases, the fellowship is open to epistemic transfer towards topic areas that do not fall under AI safety or governance, narrowly constructed. Examples include relevant topics on digital and emulated minds, advanced institutional design and collective intelligence, and industrial and scientific automation and progress.
Pre-mortem: what could go wrong?
This is a nice story, but let’s ask ourselves the important question: how could it fail?
First, even if new analogies result from this fellowship, there is a risk that they are shallow, at best useless and at worst confusing. An example of a condemnation of a class of such analogies is Yudkowsky’s criticism of biology-based timelines.
Proposed solution: In part, this will come from having experts of the other field provide enough details to reveal the shallowness of the analogy. And in cases where the core of the issues comes from understanding the mechanisms behind AGI and how it will appear, the alignment researchers involved should be able to catch it eventually. The mentor-fellow pair thus represent the first line of defense against epistemic pollution, and the wider epistemic communities in which they are embedded provide further source of feedback and scrutiny. All in all, the focus on analogies and their non-shallowness in this fellowship should increase the scrutiny enough to catch most of the shallow proposals.
Another issue comes from the difficulty of distinguishing valuable/deep analogies from useless/shallow analogies at a glance, before investing a lot of work on it, potentially wasting time.
Proposed solution:
The fellowship addresses this problem by letting (epistemic) demand drive (epistemic) supply. Concretely, this means that alignment researchers (and not fellows with expertise in other fields) propose projects according to how valuable they expect them to be. Thus the current proxy of the expected value of a given analogy is whether or not a given alignment researcher is sufficiently excited about a project to want to invest time in mentoring it.
The time constraints of the fellowship also privilege exploration, which is the main way to find out about the value of each research direction. Only a small fraction of projects needs to turn into fruitful research agendas to make up for many failed attempts.
Lastly, maybe interdisciplinary research between alignment researchers and experts from other fields is just too hard and fraught with miscommunication to work in most cases.
Interdisciplinary research is hard, and so is doing good research in general. The purpose of the fellowship is to find out more about these potential issues and solve them as well as possible.
Details of the Fellowship Program
From the website of the fellowship
The fellowship is designed for individuals with graduate-level research experience, or equivalent, in their domain of expertise who are motivated by the mission of making AI systems safe and aligned.
Between June and August 2022, fellows will work on selected projects at the intersection between the fellow’s field of expertise and AI alignment and governance. Each fellow will work in close collaboration with a mentor who will help them facilitate the domain interface.
Fellows, mentors, and selected guests will meet at two multi-day, in-person retreats held in Europe to learn about AI alignment, complex systems, epistemic challenges of interdisciplinary work, and more. Throughout the summer, fellows will benefit from a diverse program consisting of regular talks by external speakers, social events, and personal support sessions with program facilitators.
Fellows will receive a stipend of 3’000 USD per month and are expected to work full-time on their projects over the course of the fellowship, though exceptions may be possible.
Fellows can work from anywhere in the world or participate in a local residency. Any travel costs, within reason, will be reimbursed by the program.
Appendix: Sample of project proposals
The below sample of project proposals is meant to give readers a taste for the types of projects PIBBSS is hoping to facilitate.
Biodiversity and Heterogeneity in Energy Flows
Source Domain: Systems Ecology
Topic Summary:
A commonly discussed puzzle in ecology is related to the latitudinal distribution of biodiversity. A number of scholars have proposed that this is related to metabolism and the amount of energy flowing through the ecosystem. (Brown, James H., Why are there so many species in the tropics?. Journal of Biogeography 2013) An additional observation we might make is that in energy-rich ecosystems, such as tropical rainforests, where we encounter higher biodiversity, we also find a large number of organisms engaging in relatively simpler forms of energy consumption. Whereas, in energy-scarce ecosystems, there are fewer species and several organisms amongst them exhibit relatively more general intelligence in terms of their ability to source food and energy.
There has been a debate in the last few years regarding whether we should anticipate artificial agents with general intelligence or ecosystems of specialized services. To inform this debate, we want to understand:
- Whether the observation about the relationships between specialized vs general energy consumption strategies and the energy-richness of ecosystems generalizes to a wide range of other biological ecosystems, e.g. deserts, alpine areas, marine ecosystems, etc.?
- Whether existing formal models of the relationship between total energy flow, metabolic rates and biodiversity are helpful in modelling the degree of specialization of energy sourcing strategies at the organism level?
- Whether these models teach us something about how the presence of economic incentives and/or compute availability influences whether specialized AI services will be more efficient than integrated agent-like systems?
[h/t Jan Kulveit]
Basins of Robustness in Search Spaces
Source Domain: Evolutionary Biology
Topic Summary:
Within evolutionary theory, there are two approaches to explaining robustness observed in biological systems. The first is that random search is likely to find basins of robustness simply because such basins occupy significant probability mass. The second approach argues that robustness is selected for by evolution as a response to mutations and environmental perturbations. (Wagner, A., Robustness and Evolvability in Living Systems. Princeton University Press 2005).
Better understanding of the relative causal roles played by these phenomena can help us in building better models for the study of robustness and corrigibility in AI.
For example, we may ask:
- What are the core confusions and disagreements between these explanatory approaches, and what are conventional justifications used within the evolutionary biology literature to distinguish them?
- In the case of the ‘basins of robustness are large’ approach, are there any structural justifications used within the evolutionary biology literature to posit the large-ness of these basins? If not, can we uncover the justifications implicit in the works of biologists and philosophers of biology who have written on this subject?
- Finally, can we synthesize the key insights of the field into some formalisms that can help us better model how evolutionary search encounters basins of robust phenotypes? Could such formalisms help us distinguish between different kinds of parametric search trajectories and their distinct likelihoods of encountering such basins? And how can we design training methods of ML systems that can look for robust and corrigible models, using these insights?
[h/t TJ]
Institutional foundations of Linguistic Innovation
Source Domain: Sociolinguistics
Topic Summary:
Language, its evolution, and its current usage within society might limit which novel concepts can be acquired and become broadly recognized. Participants in a linguistic community experience agency to use language in innovative ways (‘creativity’ in Chomsky 1965), and therefore also exert influence over how linguistic affordances (concepts, vocabularies, etc.) evolve over time. Often such creativity is also built on top of existing morphological and lexical resources (‘productivity’ in Hockett 1958, Bauer 2001). (Also see: Expanding the Lexicon, eds. S Arndt-Lappe et al. 2018)
These forms of linguistic innovation and evolution, however, are also balanced by evolutionary pressures that help maintain reasonable levels of lexical and semantic stability in the language, allowing language to be useful for coordination. The generation, diffusion and autoregulation of linguistic innovation can therefore also be seen as being mediated by cultural and institutional factors. By better understanding the different factors that shape linguistic innovation and evolution, we can both: a) better reflect on the role played by differential deployment of Large Language Models (LLMs) in the near-term, as well as, b) better understand which of these dynamics can be extrapolated for understanding linguistic and cognitive competencies of future AI systems. Some specific questions of interest might include:
- What implications do the different frameworks for studying linguistic evolution have for modeling the rate of evolution and ways to measure it? Can they help modeling growth rates for linguistic competencies?
- Can existing theories of asymmetric agency over linguistic and conceptual affordances (Fricker 2007, Anderson 2012) provide insights, or be generalized, to perform normative analysis of the evolutionary structure?
- Are there theories that identify signs of stagnation in linguistic evolution? How do these theories see the relative role of cognitive, cultural and institutional factors? How would they change when dealing with linguistically competent artificial systems?
Social learning and the limitations of the RL framework
Source Domain: Cognitive science
Topic Summary:
Reinforcement learning is the dominant framework at the moment in the psychology and neuroscience of human and animal behavior. In thinking about digital minds, that is convenient because, if it's true that humans and animals are basically reinforcement learners, it follows that artificial RL systems are (in some sense) things of basically the same kind. It also seems to influence thinking about agency and motivation in the alignment space.
An interesting question to us is thus: what are the limitations of the RL framework for explaining human behavior? In particular, there exists preliminary evidence (Ho et al, 2017) for a limitation of RL in the area of social learning from evaluative feedback, which would seem particularly relevant to alignment.
[h/t Patrick Butlin]
Thanks for this interesting post! I am probably not well-suited to apply for the fellowship. However, I was interested in the ideas you mentioned, so I wanted to share some ideas I had regardless. They might not be useful, but it was helpful for me to get them out of my head!
Behaviour science
I work in this space, and much of the theory seems very relevant to understanding non-human agents. For instance, I wonder there would be value in exploring if models of human behaviour such as COM-B and the FBM could be useful in modelling the actions of AI agents. For instance, if it is useful to theorise that a human agent's behaviour only occurs if they have sufficient motivation and ability and a trigger to act (as per the FBM), it might also be useful to do so for a non-human agent.
Persuasion
I used to be interested in this (it is basically attitude and behaviour change).
I wonder if the idea of persuasion and underlying theory is useful for understanding how AI agents should respond to information and choose which information to share with other agents to achieve goals (i.e., to persuade). If so, then communications/processing models such as McGuire, Shannon-Weaver, or Lasswell may be useful.
Related to that, I wrote a (not very good) paper outlining the concept of persuasion a long time ago, which finished with:
"From a philosophical perspective, we recommend that future research should consider if non-human agents can not only persuade but can also be persuaded. Research already explores how emerging technologies, such as artificial intelligences, may be human-like to varying extents (see Bostrom, 2014; Kurzweil, 2005; Searle, 1980). If we can believe that non-biological beings might be conscious and human-like (Calverley, 2008; Hofstadter & Dennett, 1988) then maybe we should also consider whether these beings will have beliefs, attitudes and behaviours and thus be subject to persuasion?"
Systems thinking
I am still a novice in this area and what I know is probably outdated. I wonder if there could be value in drawing on concepts in systems thinking when attempting to manage AI. As an example, this model suggests 12 leverage points for systems change (based on this work). Could we model/manage an agent's behavioural outcomes in the same way?
I am interested to know what you think, if you have time. Do any of these areas seem fruitful? Are they irrelevant, or are there better approaches already in use?
I am very aware that I don't have a good understanding of how AI agent's behaviour is modelled with the AI safety/governance literature. I also don't understand exactly i) what differences there are between those approaches and the approaches used in behavioural science/social science or ii) justifications for different approaches would be needed for each.
Can you (or anyone else) recommend things that I should read/watch to improve my understanding?
Thanks for the thoughtful comment!
This sounds like a potentially good analogy, but one has to be careful that it doesn't rely on assumptions that only apply to humans, or to quite bounded agents.
The topics of persuasion (both from AIs and of AIs) is indeed an important topic in alignment. There's a general risk that optimization is very easily spent to push for manipulation of human, whether intentionally (training an AI which actually end up wanting to do something else, and so has reason to manipulate us) or unintentionally (training an AI such that it's incentivized to answer what we would prefer rather than the most accurate and appropriate answer).
For the persuasion of AIs by AIs, there are some initial thoughts around memetics for AIs, but they are not fully formed yet.
Don't know much about this literature, but it makes me think of more structural takes on the alignment problem, that emphasize the importance of the structure of society funneling and pushing optimization, rather than the individual power of agents to alter it.
So, as can be seen above, none of these ideas sounds bad or impossible to make work, but judging them correctly would require far more effort put into analyzing them. Maybe you should apply for the fellowship, especially for behavioral work on which you're more of an expert? ;)
It's a very good question, and shamefully I don't have any answer that's completely satisfying. But here are the next best things, some resources that will give you a more rounded perspective of alignment:
Thanks, Adam, this was very helpful! I really appreciate that you took the time to respond in such detail.
I will see what I can do for the fellowship. I might be able to convince someone else to do it and then I can collaborate with them :)
PIBBSS Summer Research Fellowship -- Q&A event
PIBBSS Fellowship 2023 is officially open!
Application deadline: Sunday, Feb 5th, 2023
Learn more and apply here.
Information sessions: 1st information session, 28th of January, 17:00 UTC (09:00 PST, 12:00 EST, 18:00 CET, 01:00 [29th of Jan] Singapore) Zoom Link
2nd information session, 29th of January, 11:00 UTC (03:00 PST, 06:00 EST, 12:00 CET, 19:00 Singapore) Zoom Link