Hide table of contents

AGI will be able to model human language and psychology very accurately. Given that, isn't alignment easy if you train the AGI to interpret linguistic prompts in the way that the "average" human would? (I know language doesn't encode an exact meaning, but for any chunk of text, there does exist a distribution of ways that humans interpret it.)

Thus, on its face, inner alignment seems fairly doable. But apparently, according to RobBesinger, "We don't know how to get an AI system's goals to robustly 'point at' objects like 'the American people' ... [or even] simpler physical systems." Why is this so difficult? Does there exist an argument that it is impossible?

Outer alignment doesn't seem very difficult to me, either. Here's a prompt I thought of: "Do not do an action if anyone in a specified list of philosophers, intellectuals, members of the public, etc. would prefer you not do it, if they had all relevant knowledge of the action and its effects beforehand, consistent with the human legal standard of informed consent." Wouldn't this prompt (in its ideal form, not exactly as I wrote it) guard against many bad actions, including power-seeking behavior?

Thank you for the help!

35

0
0

Reactions

0
0
New Answer
New Comment


7 Answers sorted by

Why is [robustly pointing at goals] so difficult? Is there an argument that it is impossible?

Well, I hope it's not impossible!  If it is, we're in a pretty bad spot.  But it's definitely true that we don't know how to do it, despite lots of hard work over the last 30+ years.  To really get why this should be, you have to understand how AI training works in a somewhat low-level way.

Suppose we want an image classifier -- something that'll tell us whether a picture has a sheep in it, let's say.  Schematically, here's how we build one:

  1. Start with a list of a few million random numbers.  These are our parameters.
  2. Find a bunch of images, some with sheep in them and some without (we know which is which because humans labeled them manually).  This is our training data.
  3. Pick some images from the training data, multiply them with the parameters in various ways, and interpret the result as a confidence, between 0 and 1, of whether each image has a sheep.
  4. Probably it did terribly!  Random numbers don't know anything about sheep.
  5. So, we make some small random-ish changes to the parameters and see if that helps.
    1. For example, we might say "we changed this parameter from 0.5 to 0.6 and overall accuracy went from 51.2% to 51.22%, next time we'll go to 0.7 and see if that keeps helping or if it's too high or what."
  6. Repeat step 5, a lot.
  7. Eventually you get to where the parameters do a good job predicting whether the images have a sheep or not, so you stop.

I'm leaving out some mathematical details, but nothing that changes the overall picture.

All modern AI training works basically this way: start with random numbers, use them to do some task, evaluate their performance, and then tweak the numbers in a way that seems to point toward better performance.

Crucially, we never know why a change to particular parameter is good, just that it is.  Similarly, we never know what the AI is "really" trying to do, just that whatever it's doing helps it do the task -- to classify the images in our training set, for example.  But that doesn't mean that it's doing what we want.  For example, maybe all the pictures of sheep are in big grassy fields, while the non-sheep pictures tend to have more trees, and so what we actually trained was an "are there a lot of trees?" classifier.  This kind of thing happens all the time in machine learning applications.  When people talk about "generalizing out of distribution", this is what they mean: the AI was trained on some data, but will it still perform the way we'd want on other, different data?  Often the answer is no. 

So that's the first big difficulty with setting terminal goals: we can't define the AI's goals directly, we just show it a bunch of examples of the thing we want and hope it learns what they all have in common.  Even after we're done, we have no way to find out what patterns it really found except by experiment, which with superhuman AIs is very dangerous.  There are other difficulties but this post is already rather long.

Is there an argument that it is impossible?

There is actually an impossibility argument. Even if you could robustly specify goals in AGI, there is another convergent phenonemon that would cause misaligned effects and eventually remove the goal structures.

You can find an intuitive summary here: https://www.lesswrong.com/posts/jFkEhqpsCRbKgLZrd/what-if-alignment-is-not-enough

Excellent explanation. It seems to me that this problem might be mitigated if we reworked AI's structure/growth so that it mimicked a human brain as closely as possible.

Part of your question here seems to be, "If we can design a system that understands goals written in natural language, won't it be very unlikely to deviate from what we really wanted when we wrote the goal?" Regarding that point, I'm not an expert, but I'll point to some discussion by experts.

There are, as you may have seen, lists of examples where real AI systems have done things completely different from what their designers were intending. For example, this talk, in the section on Goodhart's law, has a link to such a list. But from what I can tell, those examples never involve the designers specifying goals in natural language. (I'm guessing that specifying goals that way hasn't seemed even faintly possible until recently, so nobody's really tried it?)

Here's a recent paper by academic philosophers that seems supportive of your question. The authors argue that AGI systems that involve large language models would be safer than alternative systems precisely because they could receive goals written in natural language. (See especially the two sections titled "reward misspecification" -- though note also the last paragraph, where they suggest it might be a better idea to avoid goal-directed AI altogether.) If you want more details on whether that suggestion is correct, you might keep an eye on reactions to this paper. There are some comments on the LessWrong post, and I see the paper was submitted for a contest.

I think you're on to something and some related thoughts are a significant part of my research agenda. Here are some references you might find useful (heavily biased towards my own thinking on the subject), numbered by paragraph in your post:

  1. There's a lot of cumulated evidence of significant overlap between LM and human linguistic representations, scaling laws of this phenomenon seem favorable and LM embeddings have also been used as a model of shared linguistic space for transmitting thoughts during communication. I interpret this as suggesting outer alignment will likely be solved by default for LMs
  2. I think I disagree quite strongly that "We don't know how to get an AI system's goals to robustly 'point at' objects like 'the American people' ... [or even] simpler physical systems.", e.g. I suspect many alignment-relevant concepts (like 'Helpful, Harmless, Honest') are abstract and groundable in language, see e.g. Language is more abstract than you think, or, why aren't languages more iconic?. Also, the previous point (brain-LM comparisons), as well as LM performance, suggest the linguistic grounding is probably already happening to a significant degree.
  3. Robustness here seems hard, see e.g. these references on shortcuts in in-context learning (ICL) / prompting: https://arxiv.org/abs/2303.03846 https://arxiv.org/abs/2305.17256 https://arxiv.org/abs/2305.13299 https://arxiv.org/abs/2305.14950 https://arxiv.org/abs/2305.19148. An easier / more robust target might be something like 'be helpful'. Though I agree in general ICL as Bayesian inference (see e.g. http://ai.stanford.edu/blog/understanding-incontext/ and follow the citation trail, there are a lot of recent related works) suggests that the longer the prompt, the more likely it would be to 'locate the task'.

I'll also note that the role of the Constitution in Constitutional AI (https://www.anthropic.com/index/claudes-constitution) seems quite related to your 3rd paragraph.

This comment will focus on the specific approaches you set out, rather than the high level question, although I'm also interested in seeing comments from others on how difficult it is to solve alignment, and why.

The approach you've set out resembles Coherent Extrapolated Volition (CEV),  which was described earlier by Bostrom. I'm not sure what the consensus is on CEV, but here's a few thoughts which I have in my head from when I thought about CEV (several years ago now).

  • How do we choose the correct philosophers and intellectuals -- e.g. would we want Nietsche or Wagner to be on the list of intellectuals, given the (arguable) links to the Nazis? 
  • How do we extrapolate? (i.e. how do you determine whether the list of intellectuals would want the action to happen?) 
    • For example, Plato was arguably in favour of dictatorships and preferred them over democracies, but recent history seems to suggest that democracies have fared better than dictatorships -- should we extrapolate that Plato would prefer democracies if he lived today? How do we know?
    • Another example, perhaps a bit closer to home: some philosophers might argue that under some forms of utilitarianism, the ends justify the means, and it is appropriate to steal resources in order to fund activities which are in the best long-term interests of humanity. Even if those philosophers say they don't believe that, they might just be pandering to expectations from society, and the AI might extrapolate that they would say that if unfettered.

In other words, I don't think this does clearly guarantee us against power-seeking behaviour.

"How do we choose the correct philosophers?" Choose nearly all of them; don't be selective. Because the AI must get approval fom every philosopher, this will be a severe constraint, but it ensures that the AI's actions will be unambiguously good. Even if the AI has to make contentious extrapolations about some of the philosophers, I don't think it would be free to do anything awful.

Under that constraint, I wonder if the AI would be free to do anything at all. 

Ok, maybe don't include every philosopher. But I think it would be good to include people with a diverse range of views: utilitarians, deontologists, animal rights activists, human rights activists, etc. I'm uncomfortable with the thought of AI unilaterally imposing a contentious moral philosophy (like extreme utilitarianism) on the world.

Even with my constraints, I think AI would be free to solve many huge problems, e.g. climate change, pandemics, natural disasters, and extreme poverty.

Assuming it could be implemented, I definitely think your approach would help prevent the imposition of serious harms. 

I still intuitively think the AI could just get stuck though, given the range of contradictory views even in fairly mainstream moral and political philosophy. It would need to have a process for making decisions under moral uncertainty, which might entail putting additional weight on the views on certain philosophers. But because this is (as far as I know) a very recent area of ethics, the only existing work could be quite badly flawed.

I think a superintelligent AI will be able to find solutions with no moral uncertainty. For example, I can't imagine what philosopher would object to bioengineering a cure to a disease.

[anonymous]3
0
1

I don't think you need to commit yourself to including everyone. If it is true for any subset of people, then the point you gesture at in your post goes through. I have had similar thoughts to those you suggest in the post. If we gave the AI the goal of 'do what Barack Obama would do if properly informed and at his most lucid', I don't really get why we would have high confidence in a treacherous turn or of the AI misbehaving in a catastrophic way. The main response to this seems to be to point to examples of AI not doing what we intend from limited  computer games. I agree something similar might happen with advanced AI but don't get why it is guaranteed to do so or why any of the arguments I have seen lend weight to any particular probability estimate of catastrophe. 

It also seems like increased capabilities would in a sense increased alignment (with Obama) because the more advanced AIs would have a better idea of what Obama would do.  

The goal you specify in the prompt is not the goal that the AI is acting on when it responds.  Consider: if someone tells you, "Your goal is now [x]", does that change your (terminal) goals?  No, because those don't come from other people telling you things (or other environmental inputs)[1].

Understanding a goal that's been put into writing, and having that goal, are two very different things.

  1. ^

    This is a bit of an exaggeration, because humans don't generally have very coherent goals, and will "discover" new goals or refine existing ones as they learn new things.  But I think it's basically correct to say that there's no straightforward relationship between telling a human to have a goal, and them having it, especially for adults (i.e. a trained model).

Sorry, I'm still a little confused. If we establish an AI's terminal goal from the get-go, why wouldn't we have total control over it?

2
RobertM
We don't know how to do that.  It's something that falls out of its training, but we currently don't know how to even predict what goal any particular training setup will result in, let alone aim for a specific one.

I'd suggest this thread (and the linked LW post) as a good overview of the arguments. You could also take a look at the relevant section of the Intro to EA handbook or this post.

In general, I think you'll probably find that you'll get a better response on the forum if you spend some time engaging with the intro materials and come back with specific questions or arguments. 

I've tried to engage with the intro materials, but I still have several questions:

a. Why doesn't my proposed prompt solve outer alignment?

b. Why would AI ever pursue a proxy goal at the expense of its assigned goal? The human evolution analogy doesn't quite make sense to me because evolution isn't an algorithm with an assigned goal. Besides, even when friendship doesn't increase the odds of reproduction, it doesn't decrease the odds either; so this doesn't seem like an example where the proxy goal is being pursued at the expense of the assigned goal.

c. I'v... (read more)

3
DavidW
Thanks for thoughtfully engaging with this topic! I've spent a lot of time exploring arguments that alignment is hard, and am also unconvinced. I'm particularly skeptical about deceptive alignment, which is closely related to your point b. I'm clearly not the right person to explain why people think the problem is hard, but I think it's good to share alternative perspectives.  If you're interested in more skeptical arguments, there's a forum tag and a lesswrong tag. I particularly like Quintin Pope's posts on the topic. 

Maybe frame it more as if you're talking to a child. Yes you can tell the child to follow something but how are you certain that it will do it?

Similarly, how can we trust the AI to actually follow the prompt? To trust it we would fundamentally have to understand the AI or safeguard against problems if we don't understand it. The question then becomes how your prompt is represented in machine language, which is very hard to answer.

To reiterate, ask yourself, how do you know that the AI will do what you say?

Comments1
Sorted by Click to highlight new comments since:

I found this post to be astute and lucid. Thanks for writing it.

Curated and popular this week
 ·  · 22m read
 · 
The cause prioritization landscape in EA is changing. Prominent groups have shut down, others have been founded, and everyone’s trying to figure out how to prepare for AI. This is the third in a series of posts critically examining the state of cause prioritization and strategies for moving forward. Executive Summary * An increasingly common argument is that we should prioritize work in AI over work in other cause areas (e.g. farmed animal welfare, reducing nuclear risks) because the impending AI revolution undermines the value of working in those other areas. * We consider three versions of the argument: * Aligned superintelligent AI will solve many of the problems that we currently face in other cause areas. * Misaligned AI will be so disastrous that none of the existing problems will matter because we’ll all be dead or worse. * AI will be so disruptive that our current theories of change will all be obsolete, so the best thing to do is wait, build resources, and reformulate plans until after the AI revolution. * We identify some key cruxes of these arguments, and present reasons to be skeptical of them. A more direct case needs to be made for these cruxes before we rely on them in making important cause prioritization decisions. * Even on short timelines, the AI transition may be a protracted and patchy process, leaving many opportunities to act on longer timelines. * Work in other cause areas will often make essential contributions to the AI transition going well. * Projects that require cultural, social, and legal changes for success, and projects where opposing sides will both benefit from AI, will be more resistant to being solved by AI. * Many of the reasons why AI might undermine projects in other cause areas (e.g. its unpredictable and destabilizing effects) would seem to undermine lots of work on AI as well. * While an impending AI revolution should affect how we approach and prioritize non-AI (and AI) projects, doing this wisel
 ·  · 4m read
 · 
TLDR When we look across all jobs globally, many of us in the EA community occupy positions that would rank in the 99.9th percentile or higher by our own preferences within jobs that we could plausibly get.[1] Whether you work at an EA-aligned organization, hold a high-impact role elsewhere, or have a well-compensated position which allows you to make significant high effectiveness donations, your job situation is likely extraordinarily fortunate and high impact by global standards. This career conversations week, it's worth reflecting on this and considering how we can make the most of these opportunities. Intro I think job choice is one of the great advantages of development. Before the industrial revolution, nearly everyone had to be a hunter-gatherer or a farmer, and they typically didn’t get a choice between those. Now there is typically some choice in low income countries, and typically a lot of choice in high income countries. This already suggests that having a job in your preferred field puts you in a high percentile of job choice. But for many in the EA community, the situation is even more fortunate. The Mathematics of Job Preference If you work at an EA-aligned organization and that is your top preference, you occupy an extraordinarily rare position. There are perhaps a few thousand such positions globally, out of the world's several billion jobs. Simple division suggests this puts you in roughly the 99.9999th percentile of job preference. Even if you don't work directly for an EA organization but have secured: * A job allowing significant donations * A position with direct positive impact aligned with your values * Work that combines your skills, interests, and preferred location You likely still occupy a position in the 99.9th percentile or higher of global job preference matching. Even without the impact perspective, if you are working in your preferred field and preferred country, that may put you in the 99.9th percentile of job preference
 ·  · 5m read
 · 
Summary Following our co-founder Joey's recent transition announcement we're actively searching for exceptional leadership to join our C-level team and guide AIM into its next phase. * Find the full job description here * To apply, please visit the following link * Recommend someone you think could be a great fit here * Location: London strongly preferred. Remote candidates willing to work from London at least 3 months a year and otherwise overlapping at least 6 hours with 9 am to 5 pm BST will be considered. We are happy to sponsor UK work visas. * Employment Type: Full-time (35 hours) * Application Deadline: rolling until August 10, 2025 * Start Date: as soon as possible (with some flexibility for the right candidate) * Compensation: £45,000–£90,000 (for details on our compensation policy see full job description) Leadership Transition On March 15th, Joey announced he's stepping away from his role as CEO of AIM, with his planned last day as December 1st. This follows our other co-founder Karolina's completed transition in 2024. Like Karolina, Joey will transition to a board member role while we bring in new leadership to guide AIM's next phase of growth. The Opportunity AIM is at a unique inflection point. We're seeking an exceptional leader to join Samantha and Devon on our C-level team and help shape the next era of one of the most impactful organizations in the EA ecosystem. With foundations established (including a strong leadership team and funding runway), we're ready to scale our influence dramatically and see many exciting pathways to do so. While the current leadership team has a default 2026 strategic plan, we are open to a new CEO proposing radical departures. This might include: * Proposing alternative ways to integrate or spin off existing or new programs * Deciding to spend more resources trialling more experimental programs, or double down on Charity Entrepreneurship * Expanding geographically or deepening impact in existing region