Deceptive alignment and unfaithful Chains of Thought are remarkably similar to human behavior (try aligning children to your values).  This makes me hopeful that some of the research on how humans turn out to be empathetic, virtuous, and moral can be transferable to training safe superintelligence, or at least provide useful metaphors. 

This post consists of a few bullet points that outline speculative ideas inspired by this framework.

  1. Detecting misbehaviour paper describes a behaviour that is familiar to many parents: when you punish children for lying, they just lie better. This metaphor suggests that it’s useful to search for an AI equivalent of self-determination theory – a study of conditions that lead to internalization of certain values without those values being directly reinforced. The paper implies that direct reinforcement of honesty using “extrinsic” motivation leads to better concealment rather than improved overall honesty. It’s plausible that “internalized” or “self-determined” values like honesty are more robust, including clever reinterpretation to retain the visibility of alignment.
    1. Ideally, we want empathetic, broadly non-violent and honest behaviour to emerge spontaneously how self-correction emerges on its own in reasoning RL.
  2. A specific example of that is designing collaborative games where winning requires empathy and honesty. These games should be sufficiently strategic to exhibit deceptive and selfish behavbious, but also designed such that empathy and honesty end up being optimal policies.
    1. In humans, empathy likely has emerged because the group with shared genes has a better survival rate when members are empathetic, even up to being sacrificial. It’s curious, then, to design multi-agent games where the survival/victory of the group reinforces the weights of all sub-agents, including ones that lost individually. The problem with this is that collaborative-competitive games favour in-group bias, so there would be probably remain the needs for other steps to expand the “circle of compassion”.
    2. Intuitively, the value of honesty gets internalized more when it helps in the short-term, but leads to disastrous outcomes in the long run, making the policy of lying in the short run significantly less favoured even when it’s beneficial right and even when it’s unclear how and when (and if) the comeuppance will come. It should be possible to recreate this dynamic in a game.
    3. The emergence of “self-other overlap” without reinforcing it explicitly can be among the indicators of whether alignment was successful.
  3. Continuing the analogy with humans, current LLM training feels like an elite school -> Ivy League-> consulting / IB pipeline, which doesn’t produce the most healthy individuals. Generally, the problem of instrumental convergence emerges due to training LLMs to be elite problem solvers, where knowledge and power help. It seems that in humans, “enlightenment” and “non-attachment” are associated with being satisfied with “just being”, while activity like problem-solving is done from the motivation of joy, love, or compassion. This idea lies in a confusing territory of “craving vs. preferring” and likely requires the introduction or emergence of valence (see below). Nonetheless, it might be fruitful to consider what “personality” is embedded into the AI via training, as currently, it seems to be a people-pleasing overachiever.
  4. It seems “inner adult” or "wise mind" behavior is associated with having different “parts” to be online at the same time, leading to a decision that accounts for the broad vector of preferences. In humans, this is strengthened via methods like internal family systems that emphasize the harmony of parts and strong relationships between them, while maladaptive behaviour is often associated with a narrow part (paperclip maximizer?) taking over. The “Biology of an LLM” paper hints to me that something like that emerges within modern LLMs as well. For example, unsafe behaviour emerges when the “unsafe request” feature is circumvented. So it’s curious whether it’s possible to reinforce an AI, insofar as it’s likely to develop diverse preferences, to resolve conflicts between those preferences and invoke all of them when making a decision.
    1. I wonder whether the approach in this paper can be useful, assuming an LLM is a policy and the current token window is a state. I’m somewhat concerned about optimizing for preferences directly (p.1), but that can probably be at least part of the process.
  5. Contemplative practices are a powerful way to develop deep virtue in humans. Some of them seem to be possible to recreate in training AI. It’s curious to play with a metaphor of safe superintelligence as a fully enlightened superintelligent monk. In humans, however, such practices usually involve observation of internal activity (i.e., which “features” of our neural networks get activated) in a way that leads to modification of such activity. I’m not sure how to replicate it in LLMs, but the circuit tracing paper hints at the possibility of using internal feature activations as a factor in training.
    1. It seems loving-kindness meditation involves reinforcing circuits related to love and kindness in broad variety of contexts so that they are activated in most other states. "On the Biology of Large Language Models" implies that we might get to a good enough interpretability to do such reinforcement directly. However, the features discussed in the paper seem more like concepts, while kind behaviour is more likely to be associated not with an activation of a concept for kindness, but a more complex state of the brain that can be labeled as experiencing kindness.
    2. The tantric practice involves “transforming” the “shadow” using the “deity energy”. This can be interpreted as simultaneously activating subnetworks that correspond to “positive” (like compassion) and “negative” (like anger) behaviours so that this co-activation transforms both of them, especially the “negative” ones (i.e., creating active, engaged compassion). Sydney strikes me as a very clear example of a “shadow” in LLMs. Though modern ones have shadows that are much better hidden, they can still be jailbroken. I wonder whether a similar “transformation of the shadow” can be done with LLMs as well.
    3. Dissolution of self and cessation of cravings seem to be the most intriguing idea. As I understand (from reading Kaj Sotala), when the feature for “self” doesn’t get activated alongside certain stimuli, those stimuli exhibit a significantly weaker motivating force (i.e., craving). I wonder whether a similar “dissolution” can occur within LLMs to reduce emergent “cravings” for self-preservation or for solving math and coding tasks. Phenomenological accounts of advanced meditators perceiving sensations as more “spacious” imply to me that preferences aren’t eliminated, they just don’t override other preferences, like in p.3 above.
    4. However, it’s unclear how such elimination would occur in LLMs. In humans, it occurs by observing that activation of the “self” leads to craving, and the latter has negative valence, leading to a lower propensity of activating the “self”. LLMs don’t have valence (should they?), but including a direct penalty for “wanting” seems to conflict with the main training paradigm where LLMs are reinforced to solve tasks (which develops “wanting”). My thinking on this topic is currently confused. I’m also curious whether the “helpful assistant” persona is an equivalent of “self” or if this is taking the analogy too far.
    5. This direction might be a double-edged sword, as the emergence of valence and introspection is more likely to be associated with sentience.
    6. This paper is an interesting attempt in this direction, and I like their specification for what contemplative wisdom is. However, I doubt usage of just textual prompting will take us far, especially since deception can be hidden in neuralese. 

 

Note: I’m framing this in the context of virtues, as this intuitively seems like a more “robust” morality compared to utilitarianism and deontology. Also, it subjectively feels easier to train a mix of virtues (compassion, wisdom, humility) rather than the adherence to specific principles so that they can be broadly applied and not misinterpreted.

6

0
0

Reactions

0
0

More posts like this

Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities