Nobody’s on the ball on AGI alignment

Leopold - thanks for a clear, vivid, candid, and galavanizing post. I agree with about 80% of it.

However, I don't agree with your central premise that alignment is solvable. We want it to be solvable. We believe that we need it to be solvable (or else, God forbid, we might have to actually stop AI development for a few decades or centuries).

But that doesn't mean it is solvable. And we have, in my opinion, some pretty compelling reasons to think that it not solvable even in principle, (1) given the diversity, complexity, and ideological nature of many human values (which I've written about in other EA Forum posts, and elsewhere), (2) given the deep game-theoretic conflicts between human individuals, groups, companies, and nation-states (which cannot be waved away by invoking Coherent Extrapolated Volition, or 'dontkilleveryoneism', or any other notion that sweeps people's profoundly divergent interests under the carpet), and (3) given that humans are not the only sentient stakeholder species that AI would need to be aligned with (advanced AI will have implications for every other of the 65,000 vertebrate species on Earth, and most of the 1,000,000+ invertebrate species, one way or another).

Human individuals aren't aligned with each other. Companies aren't aligned with each other. Nation-states aren't aligned with each other. Other animal species aren't aligned with humans, or with each other. There is no reason to expect that any AI systems could be 'aligned' with the totality of other sentient life on Earth. Our Bayesian prior, based on the simple fact that different sentient beings have different interests, values, goals, and preferences, must be that AI alignment with 'humanity in general', or 'sentient life in general', is simply not possible. Sad, but true.

I worry that 'AI alignment' as a concept, or narrative, or aspiration, is just promising enough that it encourages the AI industry to charge full steam ahead (in hopes that alignment will be 'solved' before AI advances to much more dangerous capabilities), but it is not delivering nearly enough workable solutions to make their reckless accelerationism safe. We are getting the worst of both worlds -- a credible illusion of a path towards safety, without any actual increase in safety.

In other words, the assumption that 'alignment is solvable' might be a very dangerous X-risk amplifier, in its own right. It emboldens the AI industry to accelerate. It gives EAs (probably) false hope that some clever technical solution can make humans all aligned with each other, and make machine intelligences aligned with organic intelligences. It gives ordinary citizens, politicians, regulators, and journalists the impression that some very smart people are working very hard on making AI safe, in ways that will probably work. It may be leading China to assume that some clever Americans are already handling all those thorny X-risk issues, such that China doesn't really need to duplicate those ongoing AI safety efforts, and will be able to just copy our alignment solutions once we get them.

If we take seriously the possibility that alignment might not be solvable, we need to rethink our whole EA strategy for reducing AI X-risk. This might entail EAs putting a much stronger emphasis on slowing or stopping further AI development, at least for a while. We are continually told that 'AI is inevitable', 'the genie is out of the bottle', 'regulation won't work', etc. I think too many of us buy into the over-pessimistic view that there's absolutely nothing we can do to stop AI development, while also buying into the over-optimistic view that alignment is possible -- if we just recruit more talent, work a little more, get a few more grants, think really hard, etc.

I think we should reverse these optimisms and pessimisms. We need to rediscover some optimism that the 8 billion people on Earth can pause, slow, handicap, or stop AI development by the 100,000 or so AI researchers, devs, and entrepreneurs that are driving us straight into a Great Filter. But we need to rediscover some pessimism about the concept of 'AI alignment' itself.

In my view, the burden of proof should be on those who think that 'AI alignment with human values in general' is a solvable problem. I have seen no coherent argument that it is solvable. I've just seen people desperate to believe that it is solvable. But that's mostly because the alternative seems so alarming, i.e., the idea that (1) the AI industry is increasingly imposing existential risks on us all, (2) it has a lot of money, power, talent, influence, and hubris, (3) it will not slow down unless we make it slow down, and (4) slowing it down will require EAs to shift to a whole different set of strategies, tactics, priorities, and mind-sets than we had been developing within the 'alignment' paradigm.

yefreitor

I agree that the very strong sort of alignment you describe - with the Coherent Extrapolated Volition of humanity, or the collective interest of all sentient beings, or The Form of The Good - is probably impossible and perhaps ill-posed. Insofar as we need this sort of aligned AI for things to go as well as they possibly could, they won't.

But I don't see why that's the only acceptable target. Aligning a superintelligence with the will of basically any psychologically normal human being (narrower than any realistic target except perhaps a profit-maximizer - in which case yeah, we're doomed) would still be an ok outcome for humans: it certainly doesn't end in paperclips. And alignment with someone even slightly inclined towards impartial benevolence probably goes much better than the status quo, especially for the extremely poor.

(Animals are at much more risk here, but their current situation is also much worse: I'm extremely uncertain how a far richer world would treat factory farming)

Brian_Tomasik

I think humans may indeed find ways to scale up their control over successive generations of AIs for a while, and successive generations of AIs may be able to exert some control over their successors, and so on. However, I don't see how at the end of a long chain of successive generations we could be left with anything that cares much about our little primate goals. Even if individual agents within that system still cared somewhat about humans, I doubt the collective behavior of the society of AIs overall would still care, rather than being driven by its own competitive pressures into weird directions.

An analogy I often give is to consider our fish ancestors hundreds of millions of years ago. Through evolution, they produced somewhat smarter successors, who produced somewhat smarter successors, and so on. At each point along that chain, the successors weren't that different from the previous generation; each generation might have said that they successfully aligned their successors with their goals, for the most part. But over all those generations, we now care about things dramatically different from what our fish ancestors did (e.g., worshipping Jesus, inclusion of trans athletes, preventing children from hearing certain four-letter words, increasing the power and prestige of one's nation). In the case of AI successors, I expect the divergence may be even more dramatic, because AIs aren't constrained by biology in the way that both fish and humans are. (OTOH, there might be less divergence if people engineer ways to reduce goal drift and if people can act collectively well enough to implement them. Even if the former is technically possible, I'm skeptical that the latter is socially possible in the real world.)

Some transhumanists are ok with dramatic value drift over time, as long as there's a somewhat continuous chain from ourselves to the very weird agents who will inhabit our region of the cosmos in a million years. But I don't find it very plausible that in a million years, the powerful agents in control of the Milky Way will care that much about what certain humans around the beginning of the third millennium CE valued. Technical alignment work might help make the path from us to them more continuous, but I'm doubtful it will avert human extinction in the long run.

Hi Brian, thanks for this reminder about the longtermist perspective on humanity's future. I agree that in a million years, whatever sentient beings that are around may have little interest or respect for the values that humans happen to have now.

However, one lesson from evolution is that most mutations are harmful, most populations trying to spread into a new habitats fail, and most new species go extinct within about a million years. There's huge survivorship bias in our understanding of natural history.

I worry that this survivorship bias leads us to radically over-estimate the likely adaptiveness and longevity of any new digital sentiences and any new transhumanist innovations. New autonomous advanced AIs are likely to be extremely fragile, just because most new complex systems that haven't been battle-tested by evolution are extremely fragile.

For this reason, I think we would be foolish to rush into any radical transhumanism, or any more advanced AI systems, until we have explored human potential further, and until we have been successfully, resiliently multi-planetary, if not multi-stellar. Once we have a foothold in the stars, and humanity has reached some kind of asymptote in what un-augmented humanity can accomplish, then it might make sense to think about the 'next phase of evolution'. Until then, any attempt to push sentient evolution faster will probably result in calamity.

Brian_Tomasik

Thanks. :) I'm personally not one of those transhumanists who welcome the transition to weird posthuman values. I would prefer for space not to be colonized at all in order to avoid astronomically increasing the amount of sentience (and therefore the amount of expected suffering) in our region of the cosmos. I think there could be some common ground, at least in the short run, between suffering-focused people who don't want space colonized in general and existential-risk people who want to radically slow down the pace of AI progress. If it were possible, the Butlerian Jihad solution could be pretty good both for the AI doomers and the negative utilitarians. Unfortunately, it's probably not politically possible (even domestically much less internationally), and I'm unsure whether half measures toward it are net good or bad. For example, maybe slowing AI progress in the US would help China catch up, making a competitive race between the two countries more likely, thereby increasing the chance of catastrophic Cold War-style conflict.

Interesting point about most mutants not being very successful. That's a main reason I tend to imagine that the first AGIs who try to overpower humans, if any, would plausibly fail.

I think there's some difference in the case of intelligence at the level of humans and above, versus other animals, in adaptability to new circumstances, because human-level intelligence can figure out problems by reason and doesn't have to wait for evolution to brute-force its way into genetically based solutions. Humans have changed their environments dramatically from the ancestral ones without killing themselves (yet), based on this ability to be flexible using reason. Even the smarter non-human animals display some amount of this ability (cf. the Baldwin effect). (A web search shows that you've written about the Baldwin effect and how being smarter leads to faster evolution, so feel free to correct/critique me.)

If you mean that posthumans are likely to be fragile at the collective level, because their aggregate dynamics might result in their own extinction, then that's plausible, and it may happen to humans themselves within a century or two if current trends continue.

Brian - that all seems reasonable. Much to think about!

Yes, I think we can go further and say that alignment of a superintelligent AGI even with a single individual human may well be impossible. Is such a thing mathematically verifiable as completely watertight, given the orthogonality thesis, basic AI drives and mesaoptimisation? And if it's not watertight, then all the doom flows through the gaps of imperfect, thought to be "good enough", alignment. We need a global moratorium on AGI development. This year.

Mo Putera

...we have, in my opinion, some pretty compelling reasons to think that it not solvable even in principle, (1) given the diversity, complexity, and ideological nature of many human values... There is no reason to expect that any AI systems could be 'aligned' with the totality of other sentient life on Earth.

One way to decompose the alignment question is into 2 parts:

Can we aim ASI at all? (e.g. Nate Soares' What I mean by “alignment is in large part about making cognition aimable at all”)
Can we align it with human values? (the blockquote is an example of this)

Folks at e.g. MIRI think (1) is the hard problem and (2) isn't as hard; folks like you think the opposite. Then you all talk past each other. ("You" isn't aimed at literally you in particular, I'm summarizing what I've seen.) I don't have a clear stance on which is harder; I just wish folks would engage with the best arguments from each side.

Mo - you might be right about what MIRI thinks will be hard. I'm not sure; it often seems difficult to understand what they write about these issues, since it's often very abstract and seems not very grounded in specific goals and values that AIs might need to implement. I do think the MIRI-type approach radically under-estimates the difficulty of your point number 2.

On the other hand, I'm not at all confident that point number 1 will be easy. My hunch is that both 1 and 2 will prove surprisingly hard. Which is a good reason to pause AI research until we make a lot more progress on both issues. (And if we don't make dramatic progress on both issues, the 'pause' should remain in place as long as it takes. Which could be decades or centuries.)

howdoyousay?

I've been thinking about this very thing for quite some time, and have been thinking up a concrete interventions to help the ML community / industry grasp this. DM me if you're interested to discuss further.

Luiza

11mo

I'm new to thinking about this (getting close to a year), but a thing I learned (a bit the hard way) is that translating thinking into words proves to be a good path to a translation into actions too.

concrete interventions to help the ML community / industry grasp this
<- this sounds useful to expand on, in case you did not yet do so since you posted the comment.

Zeusfyi

Singular intelligence isn’t alignable; super intelligence as being generally like 3x smarter than all humanity very likely can be solved well and throughly. The great filter is only a theory and honestly quite a weak one given our ability to accurately assess planets outside our solar system for life is basically zero. As a rule I can’t take anyone serious when it comes to “projections” about what ASI does, from anyone without a scientifically complete and measurable definition of generalized intelligence.

Here’s our scientific definition:

We define generalization in the context of intelligence, as the ability to generate learned differentiation of subsystem components, then manipulate, and build relationships towards greater systems level understanding of the universal construct that governs the reality. This is not possible if physics weren’t universal for feedback to be derived. Zeusfyi, Inc is the only institution that has scientifically defined intelligence generalization. The purest test for generalization ability; create a construct with systemic rules that define all possible outcomes allowed; greater ability to predict more actions on first try over time; shows greater generalization; with >1 construct; ability to do same; relative to others.

Andreas Netteland

-3

Regarding the analogy you use where humans etc not being aligned with each other implying that human-machine alignment is equally hard: Humans are in competition with other humans. Nation-states are in competition with other nation-states. However AI algorithms are created by humans as a tool (at least, for now that seems to be the intention). Not to say this is an argument to think alignment is possible but I do think this is a flawed analogy.

Sanjay

This is some of the finest writing I've seen on AI alignment which both (a) covers technical content , and (b) is accessible to a non-technical audience.

I particularly liked the fact that the content was opinionated; I think it's easier to engage with content when the author takes a stance rather than just hedges their bets throughout.

Lizka

This comes late, but I appreciate this post and am curating it. I think the core message is an important one, some sections can help people develop intuitions for what the problems are, and the post is written in an accessible way (which is often not the case for AI safety-related posts). As others noted, the post also made a bunch of specific claims that others can disagree with as opposed to saying vague things or hedging a lot, which I also appreciate (see also epistemic legibility).

I share Charlie Guthmann's question here: I get the sense that some work is in a fuzzy grey area between alignment and capabilities, so comparing the amount of work being done on safety vs. capabilities is difficult. I should also note that I don't think all capabilities work can be defended as safety-relevant (see also my own post on safety-washing).

...

Quick note: I know Leopold — I don't think this influenced my decision to curate the post, but FYI.

Jan_Kulveit

In my view this is a bad decision.

As I wrote on LW

Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball - and the style of somewhat strawmanning everyone & strong words is a bit irritating.

In particular I don't appreciate the epistemic of these moves together

1. Appeal to seeing thinks from close proximity. Then I got to see things more up close. And here’s the thing: nobody’s actually on the friggin’ ball on this one!
2. Straw-manning and weakmaning what almost everyone else thinks and is doing
3. Use of an emotionally compelling words like 'real science' for vaguely defined subjects where the content may be the opposite of what people imagine. Is the empirical alchemy-style ML type of research what's advocated for as the real science?
4. What overall sounds more like the aim is to persuade, rather than explain

I think curating this signals this type of bad epistemics is fine, as long as you are strawmanning and misrepresenting others in a legible way and your writing is persuasive. Also there is no need to actually engage with existing arguments, you can just claim seeing things more up close.

Also to what extent are moderator decisions influenced by status and centrality in the community...
... if someone new and non-central to the community came up with this brilliant set of ideas how to solve AI safety:
1. everyone working on it is not on the ball. why? they are all working on wrong things!
2. promising is to do something very close to how empirical ML capabilities research works
3. this is a type of problem where you can just throw money at it and attract better ML talent
... I doubt this would have a high chance of becoming curated.

[anonymous]

Anecdata: thanks for curating, I didn’t read this when it first came through and now that I did, it really impacted me.

Edit: Coming back after approaching it on LessWrong and now I’m very confused again - seems to have been much less well received. What someone here says is, “great balance of technical and generally legible content” over there might be considered “strawmanning and frustrating”, and I really don’t know what to think.

As others noted, the post also made a bunch of specific claims that others can disagree with as opposed to saying vague things or hedging a lot, which I also appreciate (see also epistemic legibility).

Thank you for acknowledging this and emphasizing the specific claims being made. I'm guessing you didn't mean to cast aspersions through a euphemism. I'd respect you not being as explicit about it if that is part of what you meant here.

For my part, though, I think you're understating how much of a problem those other posts are, so I feel obliged to emphasize how the vagueness and hedging in some of those other posts has, wittingly or not, serving to spread hazardous misinformation. To be specific, here's an excerpt from this other comment I made raising the same concern:

Others who've tried to get across the same point [Leopold is] making have, instead of explaining their disagreements, have generally alleged almost everyone else in entire field of AI alignment are literally insane.
[...]
It counts as someone making a bold, senseless attempt to, arguably, dehumanize hundreds of their peers.

This isn't just a negligible error from somebody recognized as part of a hyperbolic fringe in AI safety/alignment community. It's direly counterproductive when it comes from leading rationalists, like Eliezer Yudkowsky and Oliver Habryka, who wield great influence in their own right, and are taken very seriously by hundreds of other people.

Ben Stewart

This was enlightening and convincing. Plus a great read!

I'm late to this, but I'm surprised that this post doesn't acknowledge the approach of inverse reinforcement learning (IRL) which Stuart Russell discussed on the 80,000 Hours podcast and which also featured in his book Human Compatible.

I'm no AI expert, but this approach seems to me like it avoids the "as these models become superhuman, humans won’t be able to reliably supervise their outputs" problem, as a superhuman AI using IRL doesn't have to be supervised, it just observes us and through doing so better understands our values.

I'm generally surprised at the lack of discussion of IRL in the community. When one of the world leaders in AI says a particular approach in AI alignment is our best hope, shouldn't we listen to them?

How can we make IRL 100% watertight? Humans make mistakes and do bad things. We can't risk that happening even once with a superintelligent AI. You can't do trial and error if you're dead after the first wrong try. Or the SAI could execute a million requests of it safely, but then the million-and-first initiates an unstoppable chain of actions that leads to a sterile planet. The way I see it is that all the doom flows through the tiniest gap in imperfect alignment once you reach a certain power level. Can IRL ever lead to mathematically verifiable 100% perfect alignment?

This is exactly the discussion I want! I’m mostly just surprised no one seems to be talking about IRL.

I don’t have firm answers (when it comes to technical AI alignment I’m a bit of a noob) but when I listened to the podcast with Stuart Russell I remember him saying that we need to build in a degree of uncertainty into the AI so they essentially have to ask for permission before they do things, or something like that. Maybe this means IRL starts to become problematic in much the same way as other reinforcement learning approaches as in some way we do “supervise” the AI, but it certainly seems like easier supervision compared to the other approaches.

Also as you say the AI could learn from bad people. This just seems an inherent risk of all possible alignment approaches though!

I guess a counter to the "asking for permission" as a solution thing is: how do you stop the AI from manipulating or deceiving people into giving it permission? Or acting in unsafe ways to minimise it's uncertainty (or even keep it's uncertainty within certain bounds). It's like the alignment problem just shifts elsewhere (also, mesaoptimization, or inner alignment, isn't really addressed by IRL).

Re learning from bad people, I think a bigger problem is instilling any human-like motivation into them at all.

You're making me want to listen to the podcast episode again. From a quick look at the transcript, Russell thinks the three principles of AI should be:

The machine’s only objective is to maximize the realization of human preferences.
The machine is initially uncertain about what those preferences are.
The ultimate source of information about human preferences is human behavior.

It certainly seems such an IRL-based AI would be more open to being told what to do than a traditional RL-based AI.

RL-based AI generally doesn't want to obey requests or have its goal be changed, because this hinders/prevents it from achieving its original goal. IRL-based AI literally has the goal of realising human preferences, so it would need to have a pretty good reason (from its point of view) not to obey someone's request.

Certainly early on, IRL-based AI would obey any request you make provided you have baked in a high enough degree of uncertainty into the AI (principle 2). After a while, the AI becomes more confident about human preferences and so may well start to manipulate or deceive people when it thinks they are not acting in their best interest. This sounds really concerning, but in theory it might be good if you have given the AI enough time to learn.

For example, after a sufficient amount of time learning about human preferences, an AI may say something like "I'm going to throw your cigarettes away because I have learnt people really value health and cigarettes are really bad for health". The person might say "no don't do that I really want a ciggie right now". If the AI ultimately knows that the person really shouldn't smoke for their own wellbeing, it may well want to manipulate or deceive the person into throwing away their cigarettes e.g. through giving an impassioned speech about the dangers of smoking.

This sounds concerning but, provided the AI has had enough time to properly learn about human preferences, the AI should, in theory, do the manipulation in a minimally-harmful way. It may for example learn that humans really don't like being tricked, so it will try to change the human's mind just by giving the person the objective facts of how bad smoking is, rather than more devious means. The most important thing seems to be that the IRL-based AI has sufficient uncertainty baked into them for a sufficient amount of time, so that they only start pushing back on human requests when they are sufficiently confident they are doing the right thing.

I'm far from certain that IRL-based AI is watertight (my biggest concern remains the AI learning from irrational/bad people), but on my current level of (very limited) knowledge it does seem the most sensible approach.

Interesting about the "System 2" vs "System 1" preference fulfilment (your cigarettes example). But all of this is still just focused on outer alignment. How does the inner shoggoth get prevented from mesaoptimising on an arbitrary goal?

I’m afraid I’m not well read on the problem of inner alignment and why optimizing on an arbitrary goal is a realistic worry. Can you explain why this might happen / provide an good, simple resource that I can read?

The LW wiki entry is good. Also the Rob Miles video I link to above explains it well with visuals and examples. I think there are 3 core parts to the AI x-risk argument: the orthogonality thesis (Copernican revolution applied to mind-space; why outer alignment is hard), Basic AI Drives (convergent instrumental goals leading to power seeking), and Mesaoptimizers (why inner alignment is hard).

Thanks. I watched Robert Miles' video which was very helpful. Especially the part where he explains why an AI might want to act in accordance with its base objective in a training environment only to then pursue its mesa objective in the real world.

I'm quite uncertain at this point, but I have a vague feeling that Russell's second principle (The machine is initially uncertain about what those preferences are) is very important here. It is a vague feeling though...

D_M_x

I appreciate that you are putting out numbers and explain the current research landscape, but I am missing clear actions.

The closest you are coming to proposing them is here:

We need a concerted effort that matches the gravity of the challenge. The best ML researchers in the world should be working on this! There should be billion-dollar, large-scale efforts with the scale and ambition of Operation Warp Speed or the moon landing or even OpenAI’s GPT-4 team itself working on this problem.[17] Right now, there’s too much fretting, too much idle talk, and way too little “let’s roll up our sleeves and actually solve this problem.”

But that still isn't an action plan. Say you convince me, most of the EA Forum and half of all university educated professionals in your city that this is a big deal. What, concretely, should we do now?

Sanjay

I think the suggestion of ELK work along the lines of Collin Burns et al counted as a concrete step that alignment researchers could take.

There may be other types of influence available for those who are not alignment researchers, which Leopold wasn't precise about. E.g. those working in the financial system may be able to use their influence to encourage more alignment work.

[anonymous]

80,000 Hours has a bunch of ideas on their AI problem profile.

(I'm not trying to be facetious. This main purpose of this post to me seems to be motivational: "I’m just trying to puncture the complacency I feel like many people I encounter have." Plus nudging existing alignment researchers towards more empirical work. [Edit: This post could also be concrete career advice if you're someone like Sanjay who read 80,000 Hours' post on the number alignment researchers and was left wondering "...so...is that basically enough, or...? After reading this post, I'm assuming that leopold's answer at least is "HELL NO."])

Charlie_Guthmann

It seems plausible that there are ≥100,000 researchers working on ML/AI in total. That’s a ratio of ~300:1, capabilities researchers:AGI safety researchers.

Barely anyone is going for the throat of solving the core difficulties of scalable alignment. Many of the people who are working on alignment are doing blue-sky theory, pretty disconnected from actual ML models.

One question I'm always left with is: what is the boundary between being an AGI safety researcher and a capabilities researcher?

For instance, My friend is getting his PhD in machine learning, he barely knows about EA or LW, and definitely wouldn't call himself a safety researcher. However, when I talk to him, it seems like the vast majority of his work deals with figuring out how ML systems act when put in foreign situations wrt the training data.

I can't claim to really understand what he is doing but it sounds to me a lot like safety research. And it's not clear to me this is some "blue-sky theory". A lot of the work he does is high-level maths proofs, but he also does lots of interfacing with ml systems and testing stuff on them. Is it fair to call my friend a capabilities researcher?

more better

I have only dabbled in ML but this sounds like he may just be testing to see how generalizable models are / evaluating whether they are overfitting or underfitting the training data based on their performance on test data(data that hasn’t been seen by the model and was withheld from the training data). This is often done to tweak the model to improve its performance.

Charlie_Guthmann

I definitely have very little idea what I’m talking about but I guess part of my confusion is inner alignment seems like a capability of ai? Apologies if I’m just confused.

akash 🔸

ML systems act when put in foreign situations wrt the training data.

Could you elaborate on this more? My guess is that they could be working on the ML ethics side of things, which is great, but different than the Safety problem.

Charlie_Guthmann

I don't remember specifics but he was looking if you could make certain claims on models acting a certain way on data not in the training data based on the shape and characteristics about the training data. I know that's vague sorry, I'll try to ask him and get a better summary.

I have so far gotten the same impression that making RLHF work as a strategy by iteratively and kind of gradually scaling it in a very operationally secure way seems like maybe the most promising approach. My viewpoint right now still remains as the one you've expressed about how, while as much as the RLHF++ has going for it in a relative sense, in leaves a lot to be desired in an absolute sense in light of the alignment/control problem for AGI.

Overall, I really appreciate how this post condenses well in detail what is increasingly common knowledge about just how inadequate are the sum total of major approaches being taken to alignment. I've read analyses with the same current conclusion from several other AGI safety/alignment researchers during the last year or two. Yet where I hit a wall is my strong sense that any alternative approaches could just as easily succumb to most if not all of the same major pitfalls you list the RLHF++ approach of having to contend with. In that sense, I also feel most of your points are redundant, To get specific about how your criticisms of RLHF apply to all the other alignment approaches as well...

This currently feels way too much like “improvise as we go along and cross our fingers” to be Plan A; this should be Plan B or Plan E.

Whether it's an approach inspired by the paradigms established in light of Christiano, Yudkowsky, interpretability research, or elsewise, I've gotten the sense essentially all alignment researchers honestly feel the same way about whatever approach to RLHF they're taking.

"It might well not work. I expect this to harvest a bunch of low-hanging fruit[...]This really shouldn't be our only plan.

I understand how it feels like, based on how some people tend to talk about RLHF, and sometimes interpretability, they're implying or suggesting that we'll be fine with just this one approach. At the same time, as far as I'm aware, when you get behind any hype, almost everyone admits that whatever particular approach to alignment they're taking may fail to generalize and shouldn't be the only plan.

It rests on pretty unclear empirical assumptions on how crunchtime will go.

I've gotten the sense that the empirical assumption for how crunchtime will go among researchers taking the RLHF approach is, for lack of a better term, kind of a medium-term forecast for the date of the tipping point for AGI, i.e., probably at least between 2030 and 2040, as opposed to between 2025 and 2030.

Given this or that certain chain/sequence of logical assumptions about the trajectory or acceleration of capabilities research, there of course is an intuitive case to be made, on rational-theoretic grounds, for acting/operating under the presumption in practice that forecasts of short(er) AGI timelines, e.g., between 1 and 5 years out, are just correct and the most accurate.

At the same time, such models for timeline and/or trajectory towards AGI anyone could, just as easily, be totally wrong. Those research teams most dedicated to really solving the control problem for transformative/general AI with the shortest timelines are also acting under assumptions derived from models that also severely lacking any empirical basis.

As far as I'm aware, there is a combined set of several for-profit startups, and non-profit research organizations, that have been trialing state-of-the-art approaches for prediction markets and forecasting methodologies, especially timelines and trajectories of capabilities research for transformative/general AI.

During the last few years, they've altogether received, at least, a few million dollars to run so many experiments to determine how to achieve more empirically based models for AI timelines or trajectories. While there may potentially be valuable insights for empirical forecasting methods overall, I'm not aware of any results at all vindicating, literally, any theoretical model for capabilities forecasting.

I’m not sure this plan puts us on track to get to a place where we can be confident that scalable alignment is solved. By default, I’d guess we’d end up in a fairly ambiguous situation.

This is yet just another criticism of the RLHF approach that I understand as just as easily applying to any approach to alignment you've mentioned, and even every remotely significant approach to alignment you didn't mention but I've also encountered.

You also mentioned, for both the relatively cohesive (set of) approach(es) inspired by Christiano's research, or for more idiosyncratic approaches, a la MIRI, you perceive to be a dead end the very abstract, and almost purely mathematical, approach being taken. That's an understandable and sympathetic take. All things being equal, I'd agree with your proposal for what should be done instead:

We need a concerted effort that matches the gravity of the challenge. The best ML researchers in the world should be working on this! There should be billion-dollar, large-scale efforts with the scale and ambition of Operation Warp Speed or the moon landing or even OpenAI’s GPT-4 team itself working on this problem.

Unfortunately, all things are not equal. The societies we live in will. for foreseeable future. keep operating on a set of very unfortunate incentive structures.

To so strongly invest into ML-based approaches to alignment research, in practice, often entails working in some capacity of advancing capabilities research even more, especially in industry and the private sector. That's a major reason why, regardless of whatever ways they might be superior, ML-based approaches to alignment are often eschewed.

I.e., most conscientious alignment researchers don't feel like their field is ready to pivot so fully to ML-based approaches to alignment, without in the process increasing whatever existential risk super-human AGI might pose to humanity, as opposed to decreasing such risk. As harsh as I'm maybe being, I also think the most novel and valuable propositions in this post are your own you've downplayed:

For example, I’m really excited about work like this recent paper (paper, blog post on broader vision), which prototypes a method to detect “whether a model is being honest” via unsupervised methods. More than just this specific result, I’m excited about the style:
Use conceptual thinking to identify methods that might plausibly scale to superhuman methods (here: unsupervised methods, which don’t rely on human supervision)
Empirically test this with current models.
I think there’s a lot more to do in this vein—carefully thinking about empirical setups that are analogous to the core difficulties of scalable alignment, and then empirically testing and iterating on relevant ML methods.

My one recommendation is that you don't dwell any longer on so many things in AI alignment as a field most alignment researchers already acknowledge, and get down your proposals for taking an evidence-based approach to expanding the robustness of alignment of unsupervised systems. That's as exciting a new research direction I've heard of in the last year too!

(Importantly, from my understanding, this isn’t OpenAI being evil or anything like that—OpenAI would love to hire more alignment researchers, but there just aren’t many great researchers out there focusing on this problem.)

Thank you emphasizing you're not implying OpenAI is evil only because some practices at OpenAI may be inadequate. I feel like I shouldn't have to thank you for that, though I do, just to emphasize how backwards the thinking and discourse in the AI safety/alignment community often is when a pall of fear and paranoia is cast on all AI capabilities researchers.

During a recent conversation about AI alignment with a few others, when I expressed a casual opinion about how AGI labs have some particularly mistaken practice, I too felt a need to clarify I didn't mean to imply that Sundar Pichai or Sam Altman are evil because of it.

I don't even remember right now what that point of criticism I made was. I don't remember if that conversation was last week or the week before. It hasn't stuck in my mind because it didn't feel that important. It was an offhand comment about a relatively minor mistake AGI labs are making, tangential to the main argument I was trying to make.

Yet it's telling that I felt a need to clarify I wasn't implying OpenAI or DeepMind is evil, during even just a private conversation. It's telling that you've felt a need to do that in this post. It's a sign of a serious problem in the mindset of at least a minority of the AI safety/alignment community.

Another outstanding feature of this post is how you've mustered the effort to explain at all why you consider different approaches to alignment to be inadequate. This distinguishes your post from others like it from the the last year. Others who've tried to get across the same point you're making have, instead of explaining their disagreements, have generally alleged almost everyone else in entire field of AI alignment are literally insane.

That's not helpful for a few reasons. Such a claim is probably not true. It'd be harder to make a more intellectually lazy or unconvincing argument. It counts as someone making a bold, senseless attempt to, arguably, dehumanize hundreds of their peers.

This isn't just a negligible error from somebody recognized as part of a hyperbolic fringe in AI safety/alignment community. It's direly counterproductive when it comes from leading rationalists, like Eliezer Yudkowsky and Oliver Habryka, who wield great influence in their own right, and are taken very seriously by hundreds of other people. Thank you for writing this post as corrective to that kind of mistake a lot of your close allies have been making too.

Jan_Kulveit

Copy-pasting here from LW.

Sorry but my rough impression from the post is you seem to be at least as confused about where the difficulties are as average of alignment researchers you think are not on the ball - and the style of somewhat strawmanning everyone & strong words is a bit irritating.

Maybe I'm getting it wrong, but it seems the model you have for why everyone is not on the ball is something like "people are approaching it too much from a theory perspective, and promising approach is very close to how empirical ML capabilities research works" & "this is a type of problem where you can just throw money at it and attract better ML talent".

I don't think these two insights are promising.

Also, again, maybe I'm getting it wrong, but I'm confused how similar you are imagining the current systems to be to the dangerous systems. It seems either the superhuman-level problems (eg not lying in a way no human can recognize) are somewhat continuous with current problems (eg not lying), and in that case it is possible to study them empirically. Or they are not. But different parts of the post seem to point in different directions. (Personally I think the problem is somewhat continuous, but many of the human-in-the-loop solutions are not, and just break down.)

Also, with what you find promising I'm confused what do you think the 'real science' to aim for is - on one hand it seems you think the closer the thing is to how ML is done in practice the more real science it is. On the other hand, in your view all deep learning progress has been empirical, often via dumb hacks and intuitions (this isn't true imo).

https://www.alignmentforum.org/s/a6ne2ve5uturEEQK7

I can't get past the feeling that all the doom will flow through an asymptote of imperfect alignment. How can scalable alignment ever be watertight enough for x-risk to drop to insignificant levels? Especially given the (ML-based) engineering approach suggested. It sounds like formal, verifiable, proofs of existential safety won't ever be an end product of all this. How long do we last in a world like that, where AI capabilities continue improving up to physical limits? Can the acute risk period really be brought to a close this way?

Jonas Hallgren 🔸

TL;DR: I totally agree with the general spirit of this post, we need people to solve alignment, and we're not on track. Go and work on alignment but before you do, try to engage with the existing research, there are reasons why it exists. There are a lot of things not getting worked on within AI alignment research, and I can almost guarantee you that within six months to a year, you can find things that people haven't worked on.

So go and find these underexplored areas in a way where you engage with what people have done before you!

There’s no secret elite SEAL team coming to save the day. This is it. We’re not on track.
If timelines are short and we don’t get our act together, we’re in a lot of trouble. Scalable alignment—aligning superhuman AGI systems—is a real, unsolved problem. It’s quite simple: current alignment techniques rely on human supervision, but as models become superhuman, humans won’t be able to reliably supervise them.
But my pessimism on the current state of alignment research very much doesn’t mean I’m an Eliezer-style doomer. Quite the opposite, I’m optimistic. I think scalable alignment is a solvable problem—and it’s an ML problem, one we can do real science on as our models get more advanced. But we gotta stop fucking around. We need an effort that matches the gravity of the challenge.^[1]

I also agree in that Eliezer's style of doom seems uncalled for and that this is a solvable but difficult problem. My personal p(doom) is something around 20%, and I think this seems quite reasonable.

Barely anyone is going for the throat of solving the core difficulties of scalable alignment. Many of the people who are working on alignment are doing blue-sky theory, pretty disconnected from actual ML models. Most of the rest are doing work that’s vaguely related, hoping it will somehow be useful, or working on techniques that might work now but predictably fail to work for superhuman systems.

Now I do want to give pushback on this claim as I see a lot of people who haven't fully engaged with the more theoretical alignment landscape making this claim. There are only 300 people working on alignment, but those people are actually doing things, and most of them aren't doing blue in the sky theory.

A note on the ARC claim:

But his research now (“heuristic arguments”) is roughly “trying to solve alignment via galaxy-brained math proofs.” As much as I respect and appreciate Paul, I’m really skeptical of this: basically all deep learning progress has been empirical, often via dumb hacks^[3] and intuitions, rather than sophisticated theory. My baseline expectation is that aligning deep learning systems will be achieved similarly.^[4]

This is essentially a claim about the methodology of science in that working on existing systems gives more information and breakthroughs compared to working on a blue-sky theory. The current hypothesis for this is that it is just a lot more information-rich to do real-world research. This is, however, not the only way to get real-world feedback loops. Christiano is not working on blue sky theory; he's using real-world feedback loops in a different way; he looks at the real world and looks for information that's already there.

A discovery of this type is, for example, the tragedy of the commons; whilst we could have created computer simulations to see the process in action, it's 10x easier to look at the world and see the real-time failures. He tells stories and sees where they fail in the future as his research methodology. This gives bits of information on where to do future experiments, like how we would be able to tell that humans would fail to stop overfishing without actually running an experiment on it.

This is also what John Wentworth does with his research; he looks at the real world as a reference frame which is quite rich in information. Now a good question is why we haven't seen that many empirical predictions from Agent Foundations. I believe it is because alignment is quite hard, and specifically, it is hard to define agency in a satisfactory way due to some really fuzzy problems (boundaries, among others) and, therefore, hard to make predictions.

We don't want to mathematize things too early either, as doing so would put us into a predefined reference frame that it might be hard to escape from. We want to find the right ballpark for agents since if we fail we might base evaluations on something that turns out to be false.

In general, there's a difference in the types of problems in alignment and empirical ML; the reference class of a "sharp-left turn" is different from something empirically verifiable as it is unclearly defined, so a good question is how we should turn one into the other. This question of how we take recursive self-improvement, inner misalignment and agent foundations into empirically verifiable ML experiments is actually something that most of the people I know in AI Alignment are currently actively working on.

This post from Alexander Turner is a great example of doing this as they try "just retargeting the search"

Other people are trying other things, such as bounding the maximisation in RL into quantilisers. This would, in turn, make AI more "content" with not maximising. (fun parallel to how utilitarianism shouldn't be unbounded)

I could go on with examples, but what I really want to say here is that alignment researchers are doing things; it's just hard to realise why they're doing things when you're not doing alignment research yourself. (If you want to start, book my calendly and I might be able to help you.)

So what does this mean for an average person? You can make a huge difference by going in and engaging with arguments and coming up with counter-examples, experiments and theories of what is actually going on.

I just want to say that it's most likely paramount to engage with the existing alignment research landscape before as it's free information and easy to fall into traps if you don't. (a good resource for avoiding some traps is John's Why Not Just sequence)

There's a couple of years worth of research there; it is not worth rediscovering from the ground up. Still, this shouldn't stop you, go and do it; you don't need a hero licence.

[anonymous]

I'm very supportive of this post. Also I will shamelessly share here a sequence I posted in February called "The Engineer's Interpretability Sequence". One of the main messages of the sequence could be described as how existing mechanistic interpretability research is not on the ball.

jacquesthibs

I agree with this post. I've been reading many more papers since first entering this field because I've been increasingly convinced of the value of treating alignment as an engineering problem and pulling insights from the literature. I've also been trying to do more thinking about how to update on the current paradigm from the classic Yud and Bostrom alignment arguments. In this respect, I applaud Quintin Pope for his work.

This week, I will send a grant proposal to continue my work in alignment. I'd be grateful if you could look at my proposal and provide some critique. It would be great to have an outside view (like yours) to give feedback on it.

Current short summary: "This project comprises two main interrelated components: accelerating AI alignment research by integrating large language models (LLMs) into a research system, and conducting direct work on alignment with a focus on interpretability and steering the training process towards aligned AI. The "accelerating alignment" agenda aims to impact both conceptual and empirical aspects of alignment research, with the ambitious long-term goal of providing a massive speed-up and unlocking breakthroughs in the field. The project also includes work in interpretability (using LLMs for interpreting models; auto-interpretability), understanding agency in the current deep learning paradigm, and designing a robustly aligned training process. The tools built will integrate seamlessly into the larger alignment ecosystem. The project serves as a testing ground for potentially building an organization focused on using language models to accelerate alignment work."

Please send me a DM if you'd like to give feedback!

[epistemic status: half-joking]

There’s no secret elite SEAL team coming to save the day.

Are there any organized groups of alignment researchers who serve as a not-so-secret, normal civilian equivalent of a SEAL team trying their best to save the day, while also trying to make no promises of being some kind of elite, hyper-competent super-team?

Jobst Heitzig (EMPO project)

At this point, I'll hear out the gameplan to align AGI from any kind of normie SEAL team. We're really scraping the bottom of the barrel right now.

The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do.

IMO, this implies we need to design AI systems so that they satisfice rather than maximize: perform a requested task at a requested performance level but no better than that and with a requested probability but no more likely than that.

Eevee🔹

Like an SLA (service level agreement)!

Jobst Heitzig (EMPO project)

Not exactly. A typical SLA only contains a lower bound, that would still allow for maximization. The program for a satisficer in the sense I meant it would states that the AL system really aims to do no better than requested. So, for example, quantilizers would not qualify since they might still (by chance) choose that action which maximizes return.

Stephen McAleese

Great post. What I find most surprising is how small the scalable alignment team at OpenAI is. Though similar teams in DeepMind and Anthropic are probably bigger.

Pato

The challenge isn’t figuring out some complicated, nuanced utility function that “represents human values”; the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do.

Why do you think this? I infer for what I've seen written in other posts and comments that this is a common belief but I don't find the reasons why.

The fact that there are specific really difficult problems with aligning ML systems doesn't mean that the original really difficult problem with finding and specifying the objectives that we want for a superintelligence were solved.

I hate it because it makes it seems like alignment is a technical problem that can be solved by a single team and as you put it in your other post we should just race and win against the bad guys.

I could try to envision what type of AI you are thinking of and how would you use it, but I would prefer if you tell me. So, what would you ask your aligned AGI to do and how would it interpret that? And how are you so sure that most alignment researchers would ask it the same things as you?

Peter Slattery 🔸

(On phone so rushed reply)

Thanks for this, it's well written and compelling.

Who needs to do what differently do you think?

Do you have object level recommendations for specific audiences?

For instance a call for more of a specific type of projects to increase the number of people working on AI Safety, their quality of work or coordination etc?

Merallis

I'm no expert on this at all but this is very interesting. I wonder if the key to alignment is some kind of inductive alignment where rather than designing a superintelligent system T_N in a vacuum you design a series of increasingly intelligent systems T_0..T_N where the alignment is inbuilt at each level and a human aligns the basic T_0. i.e. you build T_1 as a small advancement of T_0 such that it can be aligned by T_0 and so on.

Riccardo

Loved the language in the post! To the point without having to use unnecessary jargon.

There are two things I'd like you to elaborate on if possible:

> "the challenge is getting AIs to do what it says on the tin—to reliably do whatever a human operator tells them to do."

If I understand correctly you imply that there is still a human operator to a superhuman AGI, do you think this is the way that alignment will work out? What I see is that humans have flaws, do we really want to give a "genie" / extremely powerful tool to humans that even already struggle with the powerful tools that they have? At least right now these powerful tools are in the hands of the more responsible few, but if it becomes more widely accessible that's very different.

What do you think of going the direction of developing a "Guardian AI", which would still solve the alignment problem using the tools of ML, but involving humans giving up control of the alignment?

The second one is more practical, which action do you think one should take. I've of course read the recommendations that other people have put out there so far, but would be curious to hear your take on this.

Richard Baxter

AI alignment is a myth; it assumes that humans are a single homogeneous organism and that AI will be one also. Humans have competing desires and interests and so will the AGIs created by them, none of which will have independent autonomous motivations without being programmed to develop them.

Even on an individual human level alignment is a relative concept. Whether a human engages in utilitarian or deontological reasoning depends on their conception of the self/other (whether they would expect to be treated likewise).

Regarding LLM specific risk, they are not currently an intelligence threat. Like any tech however they can be deployed by malicious actors in the advancement of arbitrary goals. One reason OpenAI are publicly trialling the models early is to help everyone including researchers learn to navigate their use cases, and develop safeguards to hinder exploitation.

Addressing external claims, limiting research is a bad strategy given population dynamics; the only secure option for liberal nations is in the direction of knowledge.

Zeusfyi

-1

If you were a researcher at Zeusfyi; here’s what our chief scientist would advise:

1. That’s a ratio of ~300:1, capabilities researchers:AGI safety researchers. The scalable alignment team at OpenAI has all of ~7 people.

Even a team 7 is more than sufficient to solve scalable alignment; the problem is stemming from lack of belief in your self; your own cause and ability to solve due to false belief in it being a resource issue. In general when you solve unknowns you need wider perspective + creative IQ which is not taught in any school, likely the opposite honestly; aka systems level thinkers who can relate X field to Y field and create solutions to unknowns from subknowns; most people are afraid to “step on toes” or whatever subdivision they live in; if you wanna do great research you need to be more selfish in that way

2. You can’t solve alignment if you can’t even define and measure intelligence generality; you can’t solve what you don’t know.

3. There’s only one reason intelligence exists; if we lived in a universe that had physics that could “lie“ to you; and make up energy/rules, then nothing is predictable nor periodic, nothing can be generalized

You now have the tools to solve it

Emerson Spartz

-3

Going to say something seemingly-unpopular in a tone that usually gets downvoted but I think needs to be said anyway:

This stat is why I still have hope: 100,000 capabilities researchers vs 300 alignment researchers.

Humanity has not tried to solve alignment yet.

There's no cavalry coming - we are the cavalry.

I am sympathetic to fears of a new alignment researchers being net negative, and I think plausibly the entire field has, so far, been net negative, but guys, there are 100,000 capabilities researchers now! One more is a drop in the bucket.

If you're still on the sidelines, go post that idea that's been gathering dust in your Google Docs for the last six months. Go fill out that fundraising application.

We've had enough fire alarms. It's time to act.

Sanjay

My gut reaction when reading this comment:

This comment looks like it's written in an attempt to "be inspirational", not an attempt to share a useful insight, or ask a question.

I hope this doesn't sound unkind. I recognise that there can be value in being inspirational, but it's not what I'm looking for when I'm reading these comments.

Emerson Spartz

Thanks for the feedback. I tried to do both. I think the doomerism levels are so intense right now and need to be balanced out with a bit of inspiration.

I worry that the doomer levels are so high EAs will be frozen into inaction and non-EAs will take over from here. This is the default outcome, I think.

I worry that the doomer levels are so high EAs will be frozen into inaction and non-EAs will take over from here. This is the default outcome, I think.

On one hand, as I got at in this comment, I'm more ambivalent than you about whether it'd be worse for non-EAs to take more control over the trajectory on AI alignment.

On the other hand, one reason why I'm ambivalent about effective altruists (or rationalists) retaining that level is control is that I'm afraid that the doomer-ism may become an endemic or terminal disease for the EA community. AI alignment might be refreshed by many of those effective altruists currently staffing the field being replaced. So, thank you for pointing that out too. I expressed a similar sentiment in this comment, though I was more specific because I felt it was important to explain just how bad the doomer-ism has been getting.

Others who've tried to get across the same point [Leopold is] making have, instead of explaining their disagreements, have generally alleged almost everyone else in entire field of AI alignment are literally insane.
That's not helpful for a few reasons. Such a claim is probably not true. It'd be harder to make a more intellectually lazy or unconvincing argument. It counts as someone making a bold, senseless attempt to, arguably, dehumanize hundreds of their peers.
This isn't just a negligible error from somebody recognized as part of a hyperbolic fringe in AI safety/alignment community. It's direly counterproductive when it comes from leading rationalists, like Eliezer Yudkowsky and Oliver Habryka, who wield great influence in their own right, and are taken very seriously by hundreds of other people.

Jeffrey Kursonis

Your first comment at the top was better, it seems you were inspired. What in the entire universe of possibilities could be wrong with being inspirational?...the entire EA movement is hoping to inspire people to give and act toward the betterment of humankind...before any good idea can be implemented, there must be something to inspire a person to stand up and act. Wow, you're mindset is so off of human reality. Is this an issue of post vs. comments?...who cares if someone adds original material in comments, it's a conversation. Humans are not data in a test tube...the human spirit is another way of saying, "inspired human"...when inspired humans think, good things can happen. It is the evil of banality that is so frightening. Uninspired intellect is probably what will kill us all if it's digital.

Emerson Spartz

Sanjay, I just realized you were the top comment, and now I notice that I feel confused, because your comment directly inspired me to express my views in a tone that was more opinionated and less-hedgy.

I appreciate - no, I *love* - EA's truth seeking culture but I wish it were more OK to add a bit of Gryffindor to balance out the Ravenclaw.