Take the 2025 EA Forum Survey to help inform our strategy and prioritiesTake the survey

Quick note: I occasionally run into arguments of the form "my research advances capabilities, but it advances alignment more than it advances capabilities, so it's good on net". I do not buy this argument, and think that in most such cases, this sort of research does more harm than good. (Cf. differential technological development.)

For a simplified version of my model as to why:

  • Suppose that aligning an AGI requires 1000 person-years of research.
    • 900 of these person-years can be done in parallelizable 5-year chunks (e.g., by 180 people over 5 years — or, more realistically, by 1800 people over 10 years, with 10% of the people doing the job correctly half the time).
    • The remaining 100 of these person-years factor into four chunks that take 25 serial years apiece (so that you can't get any of those four parts done in less than 25 years).

In this toy model, a critical resource is serial time: if AGI is only 26 years off, then shortening overall timelines by 2 years is a death sentence, even if you're getting all 900 years of the "parallelizable" research done in exchange.

My real model of the research landscape is more complex than this toy picture, but I do in fact expect that serial time is a key resource when it comes to AGI alignment.

The most blatant case of alignment work that seems parallelizable to me is that of "AI psychologizing": we can imagine having enough success building comprehensible minds, and enough success with transparency tools, that with a sufficiently large army of people studying the alien mind, we can develop a pretty good understanding of what and how it's thinking. (I currently doubt we'll get there in practice, but if we did, I could imagine most of the human-years spent on alignment-work being sunk into understanding the first artificial mind we get.)

The most blatant case of alignment work that seems serial to me is work that requires having a theoretical understanding of minds/optimization/whatever, or work that requires having just the right concepts for thinking about minds. Relative to our current state of knowledge, it seems to me that a lot of serial work is plausibly needed in order for us to understand how to safely and reliably aim AGI systems at a goal/task of our choosing.

A bunch of modern alignment work seems to me to sit in some middle-ground. As a rule of thumb, alignment work that is closer to behavioral observations of modern systems is more parallelizable (because you can have lots of people making those observations in parallel), and alignment work that requires having a good conceptual or theoretical framework is more serial (because, in the worst case, you might need a whole new generation of researchers raised with a half-baked version of the technical framework, in order to get people who both have enough technical clarity to grapple with the remaining confusions, and enough youth to invent a whole new way of seeing the problem—a pattern which seems common to me in my read of the development of things like analysis, meta-mathematics, quantum physics, etc.).

As an egregious and fictitious (but "based on a true story") example of the arguments I disagree with, consider the following dialog:


Uncharacteristically conscientious capabilities researcher: Alignment is made significantly trickier by the fact that we do not have an artificial mind in front of us to study. By doing capabilities research now (and being personally willing to pause when we get to the brink), I am making it more possible to do alignment research.

Me: Once humanity gets to the brink, I doubt we have much time left. (For a host of reasons, including: simultaneous discovery; the way the field seems to be on a trajectory to publicly share most of the critical AGI insights, once it has them, before wisening up and instituting closure policies after it's too late; Earth's generally terrible track-record in cybersecurity; and a sense that excited people will convince themselves it's fine to plow ahead directly over the cliff-edge.)

Uncharacteristically conscientious capabilities researcher: Well, we might not have many sidereal years left after we get to the brink, but we'll have many, many more researcher years left. The top minds of the day will predictably be much more interested in alignment work when there's an actual misaligned artificial mind in front of them to study. And people will take these problems much more seriously once they're near-term. And the monetary incentives for solving alignment will be much more visibly present. And so on and so forth.

Me: Setting aside how I believe that the world is derpier than that: even if you were right, I still think we'd be screwed in that scenario. In particular, that scenario seems to me to assume that there is not much serial research labor needed to do alignment research.

Like, I think it's quite hard to get something akin to Einstein's theory of general relativity, or Grothendieck's simplification of algebraic geometry, without having some researcher retreat to a mountain lair for a handful of years to build/refactor/distill/reimagine a bunch of the relevant concepts.

And looking at various parts of the history of math and science, it looks to me like technical fields often move forwards by building up around subtly-bad framings and concepts, so that a next generation can be raised with enough technical machinery to grasp the problem and enough youth to find a whole new angle of attack, at which point new and better framings and concepts are invented to replace the old. "A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die and a new generation grows up that is familiar with it" (Max Planck) and all that.

If you need the field to iterate in that sort of way three times before you can see clearly enough to solve alignment, you're going to be hard-pressed to do that in five years no matter how big and important your field seems once you get to the brink.

(Even the 25 years in the toy model above feels pretty fast, to me, for that kind of iteration, and signifies my great optimism in what humanity is capable of doing in a rush when the whole universe is on the line.)


It looks to me like alignment requires both a bunch of parallelizable labor and a bunch of serial labor. I expect us to have very little serial time (a handful of years if we're lucky) after we have fledgling AGI.

When I've heard the “two units of alignment progress for one unit of capabilities progress” argument, my impression is that it's been made by people who are burning serial time in order to get a bit more of the parallelizable alignment labor done.

But the parallelizable alignment labor is not the bottleneck. The serial alignment labor is the bottleneck, and it looks to me like burning time to complete that is nowhere near worth the benefits in practice.


Some nuance I'll add:

I feel relatively confident that a large percentage of people who do capabilities work at OpenAI, FAIR, DeepMind, Anthropic, etc. with justifications like "well, I'm helping with alignment some too" or "well, alignment will be easier when we get to the brink" (more often EA-adjacent than centrally "EA", I think) are currently producing costs that outweigh the benefits.

Some relatively niche and theoretical agent-foundations-ish research directions might yield capabilities advances too, and I feel much more positive about those cases. I’m guessing it won’t work, but it’s the kind of research that seems positive-EV to me and that I’d like to see a larger network of researchers tackling, provided that they avoid publishing large advances that are especially likely to shorten AGI timelines.

The main reasons I feel more positive about the agent-foundations-ish cases I know about are:

  • The alignment progress in these cases appears to me to be much more serial, compared to the vast majority of alignment work the field outputs today.
  • I’m more optimistic about the total amount of alignment progress we’d see in the worlds where agent-foundations-ish research so wildly exceeded my expectations that it ended up boosting capabilities. Better understanding optimization in this way really would seem to me to take a significant bite out of the capabilities generalization problem, unlike most alignment work I’m aware of.
  • The kind of people working on agent-foundations-y work aren’t publishing new ML results that break SotA. Thus I consider it more likely that they’d avoid publicly breaking SotA on a bunch of AGI-relevant benchmarks given the opportunity, and more likely that they’d only direct their attention to this kind of intervention if it seemed helpful for humanity’s future prospects.[1]
  • Relatedly, the energy and attention of ML is elsewhere, so if they do achieve a surprising AGI-relevant breakthrough and accidentally leak bits about it publicly, I put less probability on safety-unconscious ML researchers rushing to incorporate it.

I’m giving this example not to say “everyone should go do agent-foundations-y work exclusively now!”. I think it’s a neglected set of research directions that deserves far more effort, but I’m far too pessimistic about it to want humanity to put all its eggs in that basket.

Rather, my hope is that this example clarifies that I’m not saying “doing alignment research is bad” or even “all alignment research that poses a risk of advancing capabilities is bad”. I think that in a large majority of scenarios where humanity’s long-term future goes well, it mainly goes well because we made major alignment progress over the coming years and decades.[2] I don’t want this post to be taken as an argument against what I see as humanity’s biggest hope: figuring out AGI alignment.


 

  1. ^

    On the other hand, weirder research is more likely to shorten timelines a lot, if it shortens them at all. More mainstream research progress is less likely to have a large counterfactual impact, because it’s more likely that someone else has the same idea a few months or years later.

    “Low probability of shortening timelines a lot” and “higher probability of shortening timelines a smaller amount” both matter here, so I advocate that both niche and mainstream researchers be cautious and deliberate about publishing potentially timelines-shortening work.

  2. ^

    "Decades" would require timelines to be longer than my median. But when I condition on success, I do expect we have more time.

Comments8


Sorted by Click to highlight new comments since:

Another problem with the differential development argument is that even if you buy that “alignment can be solved”, it’s not like it’s a vaccine you can apply to all AI so it all suddenly turns beneficial. Other people, companies, nations will surely continue to train and deploy AI models, and why would they all apply your alignment principles or tools?

I heard two arguments in response to this concern: that (1) the first aligned AGI will then kill off all other forms of AGI and make all AI related problems go away and that (2) there are more good people than bad people in the world so once techniques for alignment become available, everyone will naturally adopt them. Both of these seem like fairy tales to me.

In other words the premise that any amount of AI capabilities research is OK so long as we “solve alignment” has serious issues, and you don’t even have to believe in AGI for this to bother you.

Re 1) this relates to the strategy stealing assumption: your aligned AI can use whatever strategy unaligned AIs use to maintain and grow their power. Killing the competition is one strategy but there are many others including defensive actions and earning money / resources.

Edit: I implicitly said that it's okay to have unaligned AIs as long as you have enough aligned ones around. For example we may not need aligned companies if we have (minimally) aligned government+law enforcement.

I don't think the strategy-stealing assumption holds here: it's pretty unlikely that we'll build a fully aligned 'sovereign' AGI even if we solve alignment; it seems easier to make something corrigible / limited instead, ie something that is by design less powerful than would be possible if we were just pushing capabilities.

I don't mean to imply that we'll build a sovereign AI (I doubt it too).

Corrigible is more what I meant. Corrigible but not necessarily limited. Ie minimally intent aligned AIs which won't kill you but by the strategy stealing assumption can still compete with unaligned AIs.

I'm curious to dig into this a bit more, and hear why you think these seem like fairy tales to you (I'm not saying that I disagree...).
  
I wonder if this comes down to different ideas of what "solve alignment" means (I see you put it in quotes...) 

1) Are you perhaps thinking that realistic "solutions to alignment" will carry a significant alignment tax?  Else why wouldn't ~everyone adopt alignment techniques (that align AI systems with their preferences/values)?

2) Another source of ambiguity: there are a lot of different things people mean by "alignment", including:
* AI is aligned with objectively correct values
* AI is aligned with a stakeholder and consistently pursues their interests
* AI does a particular task as intended/expected
Is one of these in particular (or something else) that you have in mind here?

I agree that it's not trivial to assume everyone will use aligned AI.

Let's suppose the goal of alignment research is to make aligned AI equally easy/cheap to build as unaligned AI. I. e. no addition cost. If we then suppose aligned AI also has a nonzero benefit, people are incentivized to use it.

The above seems to be the perspective in this alignment research overview https://www.effectivealtruism.org/articles/paul-christiano-current-work-in-ai-alignment.

More ink could be spilled on whether aligning AI has a nonzero commercial benefit. I feel that efforts like prompting and Instruct GPT are suggestive. But this may not apply to all alignment efforts.

How does one translate mathematical/high-level agenty-foundations guidelines into code/instructions that an RL agent (or any AI agent, including a scaling laws one) can follow?

Bandwagoning onto this sensible post, another problem with this argument is that differential technological development is very fuzzy to reason about, since most of the mechanisms by which it could advance alignment are things that haven't happened yet. This means it's possible to reach any conclusion ("this work is good on net", "this work is bad on net") and motivated reasoning will make people want to reach the conclusion that the work they are doing is good on net. It's a classic case of suspicious and surprising convergence.

Curated and popular this week
 ·  · 1m read
 · 
This morning I was looking into Switzerland's new animal welfare labelling law. I was going through the list of abuses that are now required to be documented on labels, and one of them made me do a double-take: "Frogs: Leg removal without anaesthesia."  This confused me. Why are we talking about anaesthesia? Shouldn't the frogs be dead before having their legs removed? It turns out the answer is no; standard industry practice is to cut their legs off while they are fully conscious. They remain alive and responsive for up to 15 minutes afterward. As far as I can tell, there are zero welfare regulations in any major producing country. The scientific evidence for frog sentience is robust - they have nociceptors, opioid receptors, demonstrate pain avoidance learning, and show cognitive abilities including spatial mapping and rule-based learning.  It's hard to find data on the scale of this issue, but estimates put the order of magnitude at billions of frogs annually. I could not find any organisations working directly on frog welfare interventions.  Here are the organizations I found that come closest: * Animal Welfare Institute has documented the issue and published reports, but their focus appears more on the ecological impact and population decline rather than welfare reforms * PETA has conducted investigations and released footage, but their approach is typically to advocate for complete elimination of the practice rather than welfare improvements * Pro Wildlife, Defenders of Wildlife focus on conservation and sustainability rather than welfare standards This issue seems tractable. There is scientific research on humane euthanasia methods for amphibians, but this research is primarily for laboratory settings rather than commercial operations. The EU imports the majority of traded frog legs through just a few countries such as Indonesia and Vietnam, creating clear policy leverage points. A major retailer (Carrefour) just stopped selling frog legs after welfar
 ·  · 10m read
 · 
This is a cross post written by Andy Masley, not me. I found it really interesting and wanted to see what EAs/rationalists thought of his arguments.  This post was inspired by similar posts by Tyler Cowen and Fergus McCullough. My argument is that while most drinkers are unlikely to be harmed by alcohol, alcohol is drastically harming so many people that we should denormalize alcohol and avoid funding the alcohol industry, and the best way to do that is to stop drinking. This post is not meant to be an objective cost-benefit analysis of alcohol. I may be missing hard-to-measure benefits of alcohol for individuals and societies. My goal here is to highlight specific blindspots a lot of people have to the negative impacts of alcohol, which personally convinced me to stop drinking, but I do not want to imply that this is a fully objective analysis. It seems very hard to create a true cost-benefit analysis, so we each have to make decisions about alcohol given limited information. I’ve never had problems with alcohol. It’s been a fun part of my life and my friends’ lives. I never expected to stop drinking or to write this post. Before I read more about it, I thought of alcohol like junk food: something fun that does not harm most people, but that a few people are moderately harmed by. I thought of alcoholism, like overeating junk food, as a problem of personal responsibility: it’s the addict’s job (along with their friends, family, and doctors) to fix it, rather than the job of everyday consumers. Now I think of alcohol more like tobacco: many people use it without harming themselves, but so many people are being drastically harmed by it (especially and disproportionately the most vulnerable people in society) that everyone has a responsibility to denormalize it. You are not likely to be harmed by alcohol. The average drinker probably suffers few if any negative effects. My argument is about how our collective decision to drink affects other people. This post is not
 ·  · 5m read
 · 
Today, Forethought and I are releasing an essay series called Better Futures, here.[1] It’s been something like eight years in the making, so I’m pretty happy it’s finally out! It asks: when looking to the future, should we focus on surviving, or on flourishing? In practice at least, future-oriented altruists tend to focus on ensuring we survive (or are not permanently disempowered by some valueless AIs). But maybe we should focus on future flourishing, instead.  Why?  Well, even if we survive, we probably just get a future that’s a small fraction as good as it could have been. We could, instead, try to help guide society to be on track to a truly wonderful future.    That is, I think there’s more at stake when it comes to flourishing than when it comes to survival. So maybe that should be our main focus. The whole essay series is out today. But I’ll post summaries of each essay over the course of the next couple of weeks. And the first episode of Forethought’s video podcast is on the topic, and out now, too. The first essay is Introducing Better Futures: along with the supplement, it gives the basic case for focusing on trying to make the future wonderful, rather than just ensuring we get any ok future at all. It’s based on a simple two-factor model: that the value of the future is the product of our chance of “Surviving” and of the value of the future, if we do Survive, i.e. our “Flourishing”.  (“not-Surviving”, here, means anything that locks us into a near-0 value future in the near-term: extinction from a bio-catastrophe counts but if valueless superintelligence disempowers us without causing human extinction, that counts, too. I think this is how “existential catastrophe” is often used in practice.) The key thought is: maybe we’re closer to the “ceiling” on Survival than we are to the “ceiling” of Flourishing.  Most people (though not everyone) thinks we’re much more likely than not to Survive this century.  Metaculus puts *extinction* risk at about 4