3067 karmaJoined Nov 2017


Is there a particular part of my post that you disagree with? Or do you think the post is misleading. If so, how?

I think there are a lot of ways AI could go wrong, and "AIs dominating humans like how humans dominate animals" does not exhaust the scope of potential issues.

I really don’t get the “simplicity” arguments for fanatical maximising behaviour. When you consider subgoals, it seems that secretly plotting to take over the world will obviously be much more complicated? Do you have any idea how much computing power and subgoals it takes to try and conquer the entire planet? 

I think this is underspecified because 

  1. The hard part of taking over the whole planet is being able to execute a strategy that actually works in a world with other agents (who are themselves vying for power), rather than the compute or complexity cost of having the subgoal of taking over the world
  2. The difficulty of taking over the world depends on the level of technology, among other factors. For example, taking over the world in the year 1000 AD was arguably impossible because you just couldn't manage an empire that large. Taking over the world in 2024 is perhaps more feasible, since we're already globalized, but it's still essentially an ~impossible task.

My best guess is that if some agent "takes over the world" in the future, it will look more like "being elected president of Earth" rather than "secretly plotted to release a nanoweapon at a precise time, killing everyone else simultaneously". That's because in the latter scenario, by the time some agent has access to super-destructive nanoweapons, the rest of the world likely has access to similarly-powerful technology, including potential defenses to these nanoweapons (or their own nanoweapons that they can threaten you with).

This seems like an isolated demand for rigor to me. I think it's fine to say something is "no evidence" when, speaking pedantically, it's only a negligible amount of evidence.

I think that's fair, but I'm still admittedly annoyed at this usage of language. I don't think it's an isolated demand for rigor because I have personally criticized many other similar uses of "no evidence" in the past.

I think future AIs will be much more aligned than humans, because we will have dramatically more control over them than over humans.

That's plausible to me, but I'm perhaps not as optimistic as you are. I think AIs might easily end up becoming roughly as misaligned with humans as humans are to each other, at least eventually.

We did not intend to deny that some AIs will be well-described as having goals.

If you agree that AIs will intuitively have goals that they robustly pursue, I guess I'm just not sure why you thought it was important to rebut goal realism? You wrote,

The goal realist perspective relies on a trick of language. By pointing to a thing inside an AI system and calling it an “objective”, it invites the reader to project a generalized notion of “wanting” onto the system’s imagined internal ponderings, thereby making notions such as scheming seem more plausible.

But I think even on a reductionist view, it can make sense to talk about AIs "wanting" things, just like it makes sense to talk about humans wanting things. I'm not sure why you think this distinction makes much of a difference.

(I might write a longer response later, but I thought it would be worth writing a quick response now.)

I have a few points of agreement and a few points of disagreement:


  • The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
  • The hazy counting argument—while stronger than the strict counting argument—still seems like weak evidence for scheming. One way of seeing this is, as you pointed out, to show that essentially identical arguments can be applied to deep learning in different contexts that nonetheless contradict empirical evidence.

Some points of disagreement:

  • I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don't think it's literally "no evidence" for the claim here: that future AIs will scheme.
  • I disagree with the bottom-line conclusion: "we should assign very low credence to the spontaneous emergence of scheming in future AI systems—perhaps 0.1% or less"
    • I think it's too early to be very confident in sweeping claims about the behavior or inner workings of future AI systems, especially in the long-run. I don't think the evidence we have about these things is very strong right now.
    • One caveat: I think the claim here is vague. I don't know what counts as "spontaneous emergence", for example. And I don't know how to operationalize AI scheming. I personally think scheming comes in degrees: some forms of scheming might be relatively benign and mild, and others could be more extreme and pervasive.
    • Ultimately I think you've only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don't scheme. Actors such as AI labs have strong incentives to be vigilant against these types of mistakes when training AIs, but I don't expect people to come up with perfect solutions. So I'm not convinced that AIs won't scheme at all.
    • If by "scheming" all you mean is that an agent deceives someone in order to get power, I'd argue that many humans scheme all the time. Politicians routinely scheme, for example, by pretending to have values that are more palatable to the general public, in order to receive votes. Society bears some costs from scheming, and pays costs to mitigate the effects of scheming. Combined, these costs are not crazy-high fractions of GDP; but nonetheless, scheming is a constant fact of life.
    • If future AIs are "as aligned as humans", then AIs will probably scheme frequently. I think an important question is how intensely and how pervasively AIs will scheme; and thus, how much society will have to pay as a result of scheming. If AIs scheme way more than humans, then this could be catastrophic, but I haven't yet seen any decent argument for that theory.
    • So ultimately I am skeptical that AI scheming will cause human extinction or disempowerment, but probably for different reasons than the ones in your essay: I think the negative effects of scheming can probably be adequately mitigated by paying some costs even if it arises.
  • I don't think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have "goals" that they robustly attempt to pursue. It seems pretty natural to me that people will purposely design AIs that have goals in an ordinary sense, and some of these goals will be "misaligned" in the sense that the designer did not intend for them. My relative optimism about AI scheming doesn't come from thinking that AIs won't robustly pursue goals, but instead comes largely from my beliefs that:
    • AIs, like all real-world agents, will be subject to constraints when pursuing their goals. These constraints include things like the fact that it's extremely hard and risky to take over the whole world and then optimize the universe exactly according to what you want. As a result, AIs with goals that differ from what humans (and other AIs) want, will probably end up compromising and trading with other agents instead of pursuing world takeover. This is a benign failure and doesn't seem very bad.
    • The amount of investment we put into mitigating scheming is not an exogenous variable, but instead will respond to evidence about how pervasive scheming is in AI systems, and how big of a deal AI scheming is. And I think we'll accumulate lots of evidence about the pervasiveness of AI scheming in deep learning over time (e.g. such as via experiments with model organisms of alignment), allowing us to set the level of investment in AI safety at a reasonable level as AI gets incrementally more advanced. 

      If we experimentally determine that scheming is very important and very difficult to mitigate in AI systems, we'll probably respond by spending a lot more money on mitigating scheming, and vice versa. In effect, I don't think we have good reasons to think that society will spend a suboptimal amount on mitigating scheming.

Superhuman agents ruthlessly optimize for a reward at the expense of anything else we might care about. The more capable the agent and the more ruthless the optimizer, the more extreme the results.

To the extent this is an empirical claim about superhuman agents we are likely to build and not merely a definition, it needs to be argued for, not merely assumed. "Ruthless" optimization could indeed be bad for us, but current AIs don't seem well-described as ruthless optimizers.

Instead, LLMs appear corrigible more-or-less by default, and there don't appear to be strong incentives to purposely make AIs that are ruthless agents if doing so predictably harmed us.

(There's a more plausible argument that we have strong incentives to build non-ruthless agents, but these agents, by virtue of not being ruthless, seem much less risky.)

To the extent superhuman agents are simply ruthless by definition, I'd argue that this statement is largely irrelevant, since we don't seem likely to want to build ruthless agents that would predictably harm us. It's possible such agents could come about by accident, but again, this premise needs to be argued for, not merely assumed.

Some people seem to think the risk from AI comes from AIs gaining dangerous capabilities, like situational awareness. I don't really agree. I view the main risk as simply arising from the fact that AIs will be increasingly integrated into our world, diminishing human control.

Under my view, the most important thing is whether AIs will be capable of automating economically valuable tasks, since this will prompt people to adopt AIs widely to automate labor. If AIs have situational awareness, but aren't economically important, that's not as concerning.

The risk is not so much that AIs will suddenly and unexpectedly take control of the world. It's that we will voluntarily hand over control to them anyway, and we want to make sure this handoff is handled responsibly. 

An untimely coup, while possible, is not necessary.

Barnett argues that future technology will be primarily used to satisfy economic consumption (aka selfish desires). That seems even plausible to me, however, I’m not that concerned about this causing huge amounts of future suffering (at least compared to other s-risks). It seems to me that most humans place non-trivial value on the welfare of (neutral) others such as animals. Right now, this preference (for most people) isn’t strong enough to outweigh the selfish benefits of eating meat. However, I’m relatively hopeful that future technology would make such types of tradeoffs much less costly.

At the same time it becomes less selfishly costly to be kind to animals due to technological progress, it could become more selfishly enticing to commit other moral tragedies. For example, it could hypothetically turn out, just as a brute empirical fact, that the most effective way of aligning AIs is to treat them terribly in some way, e.g. by brainwashing them or subjecting them to painful stimuli. 

More generally, technological progress doesn't seem to asymmetrically make people more moral. Factory farming, as a chief example, allowed people to satisfy their desire for meat more cost-effectively, but at a larger moral cost compared to what existed previously. Even if factory farming is eventually replaced with something humane, there doesn't seem to be an obvious general trend here.

The argument you allude to that I find most plausible here is the idea that incidental s-risks as a byproduct of economic activity might not be as bad as some other forms of s-risks. But at the very least, incidental s-risks seem plausibly quite bad in expectation regardless.

In some circles that I frequent, I've gotten the impression that a decent fraction of existing rhetoric around AI has gotten pretty emotionally charged. And I'm worried about the presence of what I perceive as demagoguery regarding the merits of AI capabilities and AI safety. Out of a desire to avoid calling out specific people or statements, I'll just discuss a hypothetical example for now.

Suppose an EA says, "I'm against OpenAI's strategy for straightforward reasons: OpenAI is selfishly gambling everyone's life in a dark gamble to make themselves immortal." Would this be a true, non-misleading statement? Would this statement likely convey the speaker's genuine beliefs about why they think OpenAI's strategy is bad for the world?

To begin to answer these questions, we can consider the following observations:

  1. It seems likely that AI powerful enough to end the world would presumably also be powerful enough to do lots of incredibly positive things, such as reducing global mortality and curing diseases. By delaying AI, we are therefore equally "gambling everyone's life" by forcing people to face ordinary mortality.
  2. Selfish motives can be, and frequently are, aligned with the public interest. For example, Jeff Bezos was very likely motivated by selfish desires in his accumulation of wealth, but building Amazon nonetheless benefitted millions of people in the process. Such win-win situations are common in business, especially when developing technologies.

Because of the potential for AI to both pose great risks and great benefits, it seems to me that there are plenty of plausible pro-social arguments one can give for favoring OpenAI's strategy of pushing forward with AI. Therefore, it seems pretty misleading to me to frame their mission as a dark and selfish gamble, at least on a first impression.

Here's my point: Depending on the speaker, I frequently think their actual reason for being against OpenAI's strategy is not because they think OpenAI is undertaking a dark, selfish gamble. Instead, it's often just standard strong longtermism. A less misleading statement of their view would go something like this:

"I'm against OpenAI's strategy because I think potential future generations matter more than the current generation of people, and OpenAI is endangering future generations in their gamble to improve the lives of people who currently exist."

I claim this statement would—at least in many cases—be less misleading than the other statement because it captures a major genuine crux of the disagreement: whether you think potential future generations matter more than currently-existing people.

This statement also omits the "selfish" accusation, which I think is often just a red herring designed to mislead people: we don't normally accuse someone of being selfish when they do a good thing, even if the accusation is literally true.

(There can, of course, be further cruxes, such as your p(doom), your timelines, your beliefs about the normative value of unaligned AIs, and so on. But at the very least, a longtermist preference for future generations over currently existing people seems like a huge, actual crux that many people have in this debate, when they work through these things carefully together.)

Here's why I care about discussing this. I admit that I care a substantial amount—not overwhelming, but it's hardly insignificant—about currently existing people. I want to see people around me live long, healthy and prosperous lives, and I don't want to see them die. And indeed, I think advancing AI could greatly help currently existing people. As a result, I find it pretty frustrating to see people use what I perceive to be essentially demagogic tactics designed to sway people against AI, rather than plainly stating their cruxes about why they actually favor the policies they do. 

These allegedly demagogic tactics include:

  1. Highlighting the risks of AI to argue against development while systematically omitting the potential benefits, hiding a more comprehensive assessment of your preferred policies.
  2. Highlighting random, extraneous drawbacks of AI development that you wouldn't ordinarily care much about in other contexts when discussing innovation, such as potential for job losses from automation. This type of rhetoric looks a lot like "deceptively searching for random arguments designed to persuade, rather than honestly explain one's perspective" to me, a lot of the time.
  3. Conflating, or at least strongly associating, the selfish motives of people who work at AI firms with their allegedly harmful effects. This rhetoric plays on public prejudices by appealing to a widespread but false belief that selfish motives are usually suspicious, or can't translate into pro-social results. In fact, there is no contradiction with the idea that most people at OpenAI are in it for the money, status, and fame, but also what they're doing is good for the world, and they genuinely believe that.

I'm against these tactics for a variety of reasons, but one of the biggest reasons is that they can, in some cases, indicate a degree of dishonesty, depending on the context. And I'd really prefer EAs to focus on trying to be almost-maximally truth-seeking in both their beliefs and their words.

Speaking more generally—to drive one of my points home a little more—I think there are roughly three possible views you could have about pushing for AI capabilities relative to pushing for pausing or more caution:

  1. Full-steam ahead view: We should accelerate AI at any and all costs. We should oppose any regulations that might impede AI capabilities, and embark on a massive spending spree to accelerate AI capabilities.
  2. Full-safety view: We should try as hard as possible to shut down AI right now, and thwart any attempt to develop AI capabilities further, while simultaneously embarking on a massive spending spree to accelerate AI safety.
  3. Balanced view: We should support a substantial mix of both safety and acceleration efforts, attempting to carefully balance the risks and rewards of AI development to ensure that we can seize the benefits of AI without bearing intolerably high costs.

I tend to think most informed people, when pushed, advocate the third view, albeit with wide disagreement about the right mix of support for safety and acceleration. Yet, on a superficial level—on the level of rhetoric—I find that the first and second view are surprisingly common. On this level, I tend to find e/accs in the first camp, and a large fraction of EAs in the second camp.

But if your actual beliefs are something like the third view, I think that's an important fact to emphasize in honest discussions about what we should do with AI. If your rhetoric is consistently aligned with (1) or (2) but your actual beliefs are aligned with (3), I think that can often be misleading. And it can be especially misleading if you're trying to publicly paint other people in the same camp—the third one—as somehow having bad motives merely because they advocate a moderately higher mix of acceleration over safety efforts than you do, or vice versa.

I think OpenAI doesn't actually advocate a "full-speed ahead approach" in a strong sense. A hypothetical version of OpenAI that advocated a full speed ahead approach would immediately gut its safety and preparedness teams, advocate subsidies for AI, and argue against any and all regulations that might impede their mission.

Now, of course, there might be political reasons why OpenAI doesn't come out and do this. They care about their image, and I'm not claiming we should take all their statements at face value. But another plausible theory is simply that OpenAI leaders care about both acceleration and safety. In fact, caring about both safety and acceleration seems quite rational from a purely selfish perspective.

I claim that such a stance wouldn't actually be much different than the allegedly "ordinary" view that I described previously: that acceleration, rather than pausing or shutting down AI, can be favored in many circumstances.

OpenAI might be less risk averse than average compared to the general public, but in that case we're talking about a difference in degree here, not a qualitative difference in motives.

I think "if you believe the probability that a technology will make humanity go extinct with a probability of 1% or more, be very very cautious" would be endorsed by a large majority of the general population & intellectual 'elite'.

I'm not sure we disagree. A lot seems to depend on what is meant by "very very cautious". If it means shutting down AI as a field, I'm pretty skeptical. If it means regulating AI, then I agree, but I also think Sam Altman advocates regulation too.

I agree the general population would probably endorse the statement "if a technology will make humanity go extinct with a probability of 1% or more, be very very cautious" if given to them in a survey of some kind, but I think this statement is vague, and somewhat misleading as a frame for how people would think about AI if they were given more facts about the situation.

Firstly, we're not merely talking about any technology here; we're talking about a technology that has the potential to both disempower humans, but also make their lives dramatically better. Almost every technology has risks as well as benefits. Probably the most common method people use when deciding whether to adopt a technology themselves is to check whether the risks outweigh the benefits. Just looking at the risks alone gives a misleading picture.

The relevant statistic is the risk to benefit ratio, and here it's really not obvious that most people would endorse shutting down AI if they were aware of all the facts. Yes, the risks are high, but so are the benefits. 

If elites were made aware of both the risks and the benefits from AI development, most of them seem likely to want to proceed cautiously, rather than not proceed at all, or pause AI for many years, as many EAs have suggested. To test this claim empirically, we can just look at what governments are already doing with regards to AI risk policy, after having been advised by experts; and as far as I can tell, all of the relevant governments are substantially interested in both innovation and safety regulation.

Secondly, there's a persistent and often large gap between what people say through their words (e.g. when answering surveys) and what they actually want as measured by their behavior. For example, plenty of polling has indicated that a large fraction of people are very cautious regarding GMOs, but in practice most people are willing to eat GM foods happily without much concern. People are often largely thoughtless when answering many types of abstract questions posed to them, especially about topics they have little knowledge about. And this makes sense, because their responses typically have almost no impact on anything that might immediately or directly impact them. Bryan Caplan has discussed these issues in surveys and voting systems before.

Load more