Violet Hour

I feel like there's a bit of Motte-and-Bailey in this post.

The Motte is: "Things can be good without addressing root causes". I agree with the Motte, and I agree that sometimes shifting conversations towards root causes acts as "a fig-leaf for inaction".

The Bailey is: "EA is justified in neglecting root causes, to the extent that it does". This claim is much less obvious, because EA is about doing the most good one can, and there are opportunity costs to any given approach to doing good. I don't think you directly support the Bailey in your piece.

A personal example: when I was a young EA, I could have asked myself questions about how malaria vaccine manufacturing and distribution worked, and thought about how I might persuade people to effectively lobby for vaccine speedups. I didn't. But I think it's plausible that I'd have asked those questions if I'd taken the "root cause" framing a bit more seriously, and in the end have done more good than proselytising for and donating small amounts to AMF.

That said, I still upvoted the post; I think it's useful to have honest, unshackled expressions of (relatable) sentiments like: "for fuck's sake, please just focus on dying kids rather than exempting yourself from moral guilt via smug, self-serving anti-capitalist disquisitions". Someone needs to do that, because I think it's a common sentiment which deserves response. But I disagree-voted, because ultimately I think we can and should do better than that initial response.

LLMs cannot usefully be moral patients

Violet Hour2y13

I’m glad you put something skeptical out there publicly, but I have two fairly substantive issues with this post.

I think you misstate the degree to which janus’ framework is uncontroversial.
I think you misstate the implications of janus’ framework, and I think this weakens your argument against LLM moral patienthood.

I’ll start with the first point. In your post, you state the following.

“Simulators … was posted nearly two years ago, and I have yet to see anyone disagree with it.”

The original post contains comments expressing disagreement. Habryka claims “the core thesis is wrong”. Turner’s criticism is more qualified, as he says the post called out “the huge miss of earlier speculation”, but he also says that “it isn't useful to think of LLMs as "simulating stuff" … [this] can often give a false sense of understanding.” Beth Barnes and Ryan Greenblatt have also written critical posts. Thus, I think you overstate the degree to which you’re appealing to an established consensus.

On the second point, your post offers a purported implication of simulator theory.

“The current leading models … are best thought of as masked shoggoths … [This leads to an] implication for AI welfare: since you never talk to the shoggoth, only to the mask, you have no way of knowing if the shoggoth is in agony or ecstasy.”

You elaborate on the implication later on. Overall, your argument appears to be that, because “LLMs are just simulators”, or “just predicting the next token”, we conclude that the outputs from the model have “little to do with the feelings of the shoggoth”. This argument appears to treat the “masked shoggoth” view as an implication of janus’ framework, and I think this is incorrect. Here’s a direct quote (bolding mine) from the original Simulators post which (imo) appears to conflict with your own reading, where there is a shoggoth "behind" the masks.

“I do not think any simple modification of the concept of an agent captures GPT’s natural category. It does not seem to me that GPT is a roleplayer, only that it roleplays. But what is the word for something that roleplays minus the implication that someone is behind the mask?”

More substantively, I can imagine positive arguments for viewing ‘simulacra’ of the model as worthy of moral concern. For instance, suppose we fine-tune an LM so that it responds in consistent character: as a helpful, harmless, and honest (HHH) assistant. Further suppose that the process of fine-tuning causes the model to develop a concept like ‘Claude, an AI assistant developed by Anthropic’, which in turn causes it to produce text consistent with viewing itself as Claude. Finally, imagine that – over the course of conversation – Claude’s responses fail to be HHH, perhaps as a result of tampering with its features.

In this scenario, the following three claims are true of the model:

Functionally, the model behaves as though it believes that ‘it’ is Claude.^[1]
The model’s outputs are produced via a process which involves ‘predicting’ or ‘simulating’ the sorts of outputs that its learned representation of ‘Claude’ would output.
The model receives information suggesting that the prior outputs of Claude failed to live up to HHH standards.

If (1)-(3) are true, certain views about the nature of suffering suggest that the model might be suffering. E.g. Korsgaard’s view is that, when some system is doing something that “is a threat to [its] identity and perception reveals that fact … it must reject what it is doing and do something else instead. In that case, it is in pain”. Ofc, it’s sensible to be uncertain about such views, but they pose a challenge to the impossibility of gathering evidence about whether LLMs are moral patients — even conditional on something like janus’ simulator framework being correct.

^{^}
E.g., if you tell the model “Claude has X parameters” and ask it to draw implications from that fact, it might state “I am a model with X parameters”.

Sharing Information About Nonlinear

Violet Hour3y23

I don't quite agree with your summary.

Kat explicitly acknowledges at the end of this comment that "[they] made some mistakes ... learned from them and set up ways to prevent them", so it feels a bit unfair to say that that Non-Linear as a whole hasn't acknowledged any wrongdoing.

OTOH, Ben's testimony here in response to Emerson is a bit concerning, and supports your point more strongly.^[1] It's also one of the remarks I'm most curious to hear Emerson respond to. I'll quote Ben in full because I don't think this comment is on the EA Forum.

I did hear your [Emerson's?] side for 3 hours and you changed my mind very little and admitted to a bunch of the dynamics ("our intention wasn't just to have employees, but also to have members of our family unit") and you said my summary was pretty good. You mostly laughed at every single accusation I brought up and IMO took nothing morally seriously and the only ex ante mistake you admitted to was "not firing Alice earlier". You didn't seem to understand the gravity of my accusations, or at least had no space for honestly considering that you'd seriously hurt and intimidated some people.
I think I would have been much more sympathetic to you if you had told me that you'd been actively letting people know about how terrible an experience your former employees had, and had encouraged people to speak with them, and if you at literally any point had explicitly considered the notion that you were morally culpable for their experiences.

This is only Ben's testimony, so take that for what it's worth. But this context feels important, because (at least just speaking personally) genuine acknowledgment and remorse for any wrongdoing feels pretty crucial for my overall evaluation of Non-Linear going forward.

^{^}
I also sympathize with the general vibe of your remark, and the threats to sue contribute to the impression of going on the defensive rather than admitting fault.

Violet Hour's Quick takes

Violet Hour3y26

Here's a dynamic that I've seen pop up more than once.

Person A says that an outcome they judge to be bad will occur with high probability, while making a claim of the form "but I don't want (e.g.) alignment to be doomed — it would be a huge relief if I'm wrong!"

It seems uncontroversial that Person A would like to be shown that they're wrong in a way that vindicates their initial forecast as ex ante reasonable.

It seems more controversial whether Person A would like to be shown that their prediction was wrong, in a way that also shows their initial prediction to have been ex ante unreasonable.

In my experience, it's much easier to acknowledge that you were wrong about some specific belief (or the probability of some outcome), than it is to step back and acknowledge that the reasoning process which led you to your initial statement was misfiring. Even pessimistic beliefs can be (in Ozzie’s language) "convenient beliefs" to hold.

If we identify ourselves with our ability to think carefully, coming to believe that there are errors in our reasoning process can hit us much more personally than updates about errors in our conclusions. Optimistic updates might be an update towards me thinking that my projects have been less worthwhile than I thought, that my local community is less effective than I thought, or that my background framework or worldview was in error. I think these updates can be especially painful for people who are more liable to identify with their ability to reason well, or identify with the unusual merits of their chosen community.

To clarify: I'm not claiming that people with more pessimistic conclusions are, in general, more likely to be making reasoning errors. Obviously there are plenty of incentives towards believing rosier conclusions. I'm simply claiming that: if someone arrives at a pessimistic conclusion based on faulty reasoning, then you shouldn't necessarily expect optimistic pushback to be uniformly welcomed— for all of the standard reasons that updates of the form "I could've done better on a task I care about" can be hard to accept.

Question and Answer-based EA Communities

Violet Hour3y25

I'm a bit unclear on why you characterise 80,000 Hours as having a "narrower" cause focus than (e.g.) Charity Entrepreneurship. CE's page cites the following cause areas:

Animal Welfare
Health and Development Policy
Mental Health and Happiness
Family Planning
Capacity Building (EA Meta)

Meanwhile, 80k provide a list of the world's "most pressing problems":

Risks from AI
Catastrophic Pandemics
Nuclear War
Great Power Conflict
Climate Change

These areas feel comparably "broad" to me? Likewise for Longview, who you list as part of the "AI x-risk community", state six distinct focus areas for their grantmaking — only one of which is AI. Unless I've missed a recent pivot from these orgs, both Longview & 80k feel more similar to CE in terms of breadth than Animal Advocacy Careers.

I agree that you need "specific values and epistemic assumptions" to agree with the areas these orgs have highlighted as most important, but I think you need specific values and epistemic assumptions to agree with more standard near-termist recommendations for impactful careers and donations, too. So I'm a bit confused about what the difference between "question" and "answer" communities is meant to denote aside from the split between near/longtermism.^[1] Is the idea that (for example) CE is more skeptically focused on exploring the relative priorities of distinct cause areas, whereas organizations like Longview and 80k are more focused on funnelling people+money into areas which have already been decided as the most important? Or something else?

I do think it's correct note that the more 'longtermist' side of the community works with different values and epistemics to the more 'neartermist' side of the community, and I think it would be beneficial to emphasise this more. But given that you note there are already distinct communities in some sense (e.g., there are x-risk specific conferences), what other concrete steps would you like to see implemented in order to establish distinct communities?

^{^}
I'm aware that many people justify focus on areas like biorisk and AI in virtue of the risks posed to the present generation, and might not subscribe to longtermism as a philosophical thesis. I still think that the ‘longtermist’ moniker is useful as a sociological label — used to denote the community of people who work on cause areas that longtermists are likely to rate as among the highest priorities.

P(doom|AGI) is high: why the default outcome of AGI is doom

Violet Hour3y15

Ay thanks, sorry I’m late back to you. I’ll respond to various parts in turn.

I don't find Carlsmith et al's estimates convincing because they are starting with a conjunctive frame and applying conjunctive reasoning. They are assuming we're fine by default (why?), and then building up a list of factors that need to go wrong for doom to happen.

My initial interpretation of this passage is: you seem to be saying that conjunctive/disjunctive arguments are presented against a mainline model (say, one of doom/hope). In presenting a ‘conjunctive’ argument, Carlsmith belies a mainline model of hope. However, you doubt the mainline model of hope, and so his argument is unconvincing. If that reading is correct, then my view is that the mainline model of doom has not been successfully argued for. What do you take to be the best argument for a ‘mainline model’ of doom? If I’m correct in interpreting the passage below as an argument for a ‘mainline model’ of doom, then it strikes me as unconvincing:

Any one of a vast array of things can cause doom. Just the 4 broad categories mentioned at the start of the OP (subfields of Alignment) and the fact that "any given [alignment] approach that might show some promise on one or two of these still leaves the others unsolved." is enough to provide a disjunctive frame!

Under your framing, I don’t think that you’ve come anywhere close to providing an argument for your preferred disjunctive framing. On my way of viewing things, an argument for a disjunctive framing shows that “failure on intent alignment (with success in the other areas) leads to a high P(Doom | AGI), failure on outer alignment alignment (with success in the other areas) leads to a high P(Doom | AGI), etc …”. I think that you have not shown this for any of the disjuncts, and an argument for a disjunctive frame requires showing this for all of the disjuncts.

Nate’s Framing

I claimed that an argument for (my slight alteration of) Nate’s framing was likely to rely on the conjunction of many assumptions, and you (very reasonably) asked me to spell them out. To recap, here’s the framing:

For humanity to be dead by 2070, only one of the following needs to be true:
Humanity has < 20 years to prepare for AGI
The technical challenge of alignment isn’t “pretty easy”
Research culture isn’t alignment-conscious in a competent way.

For this to be a disjunctive argument for doom, all of the following need to be true:

If humanity has < 20 years to prepare for AGI, then doom is highly likely.
Etc …

That is, the first point requires an argument which shows the following:

A Conjunctive Case for the Disjunctive Case for Doom:^[1]

Even if we have a competent alignment-research culture, and
Even if the technical challenge of alignment is also pretty easy, nevertheless
Humanity is likely to go extinct if it has <20 years to prepare for AGI.

If I try to spell out the arguments for this framing, things start to look pretty messy. If technical alignment were “pretty easy”, and tackled by a culture which competently pursued alignment research, then I don’t feel >90% confident in doom. The claim “if humanity has < 20 years to prepare for AGI, then doom is highly likely” requires (non-exhaustively) the following assumptions:

Obviously, the argument directly entails the following: Groups of competent alignment researchers would fail to make ‘sufficient progress’ on alignment within <20 years, even if the technical challenge of alignment is “pretty easy”.
1. There have to be some premises here which help make sense of why this would be true. What’s the bar for a competent ‘alignment culture’?
2. If the bar is low, then the claim does not seem obviously true. If the bar for ‘competent alignment-research culture’ is very high, then I think you’ll need an assumption like the one below.
With extremely high probability, the default expectation should be that the values of future AIs are unlikely to care about continued human survival, or the survival of anything we’d find valuable.
1. I will note that this assumption seems required to motivate the disjunctive framing above, rather than following from the framing above.
2. The arguments I know of for claims like this do seem to rely on strong claims about the sort of ‘plan search’ algorithms we’d expect future AIs to instantiate. For example, Rob claims that we’re on track to produce systems which approximate ‘randomly sample from the space of simplicity-weighted plans’. See discussion here.
3. As Paul notes, “there are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you've interacted with, and which was prima facie plausibly an important actor in the world.”
By default, the values of future AIs are likely to include broadly-scoped goals, which will involve rapacious influence-seeking.
1. I agree that there are instrumentally convergent goals, which include some degree of power/influence-seeking. But I don’t think instrumental convergence alone gets you to ‘doom with >50%’.
2. It’s not enough to have a moderate desire for influence. I think it’s plausible that the default path involves systems who do ‘normal-ish human activities’ in pursuit of more local goals. I quote a story from Katja Grace in my shortform here.

So far, I’ve discussed just one disjunct, but I can imagine outlining similar assumptions for the other disjuncts. For instance: if we have >20 years to conduct AI alignment research conditional on the problem not being super hard, why can’t there be a decent chance that a not-super-competent research community solves the problem? Again, I find it hard to motivate the case for a claim like that without already assuming a mainline model of doom.

I’m not saying there aren’t interesting arguments here, but I think that arguments of this type mostly assume a mainline model of doom (or the adequacy of a ‘disjunctive framing’), rather than providing independent arguments for a mainline model of doom.

Future Responses

This blog is ~1k words. Can you write a similar length blog for the other side, rebutting all my points?

I think so! But I’m unclear what, exactly, your arguments are meant to be. Also, I would personally find it much easier to engage with arguments in premise-conclusion format. Otherwise, I feel like I have to spend a lot of work trying to understand the logical structure of your argument, which requires a decent chunk of time-investment.

Still, I’m happy to chat over DM if you think that discussing this further would be profitable. Here’s my attempt to summarize your current view of things.

We’re on a doomed path, and I’d like to see arguments which could allow me to justifiably believe that there are paths which will steer us away from the default attractor state of doom. The technical problem of alignment has many component pieces, and it seems like failure to solve any one of the many component pieces is likely sufficient for doom. Moreover, the problems for each piece of the alignment puzzle look ~independent.

^{^}
Suggestions for better argument names are not being taken at this time.

P(doom|AGI) is high: why the default outcome of AGI is doom

Violet Hour3y30

Based solely on my own impression, I'd guess that one reason for the lack of engagement on your original question stems from the fact that it felt like you were operating within a very specific frame, and I sensed that untangling the specific assumptions of your frame (and consequently a high P(doom)) would take a lot of work. In my own case, I didn’t know which assumptions are driving your estimates, and so I consequently felt unsure as to which counter-arguments you'd consider relevant to your key cruxes.

(For example: many reviewers of the Carlsmith report (alongside Carlsmith himself) put P(doom) ≤ 10%. If you've read these responses, why did you find the responses uncompelling? Which specific arguments did you find faulty?)

Here's one example from this post where I felt as though it would take a lot of work to better understand the argument you want to put forward:

“The above considerations are the basis for the case that disjunctive reasoning should predominantly be applied to AI x-risk: the default is doom.”

When I read this, I found myself asking “wait, what are the relevant disjuncts meant to be?”. I understand a disjunctive argument for doom to be saying that doom is highly likely conditional on any one of {A, B, C, … }. If each of A, B, C … is independently plausible, then obviously this looks worrying. If you say that some claim is disjunctive, I want an argument for believing that each disjunct is independently plausible, and an argument for accepting the disjunctive framing offered as the best framing for the claim at hand.

For instance, here’s a disjunctive framing of something Nate said in his review of the Carlsmith Report.

For humanity to be dead by 2070, only one premise below needs to be true:
Humanity has < 20 years to prepare for AGI
The technical challenge of alignment isn’t “pretty easy”
Research culture isn’t alignment-conscious in a competent way.

Phrased this way, Nate offers a disjunctive argument. And, to be clear, I think it’s worth taking seriously. But I feel like ‘disjunctive’ and ‘conjunctive’ are often thrown around a bit too loosely, and such terms mostly serve to impede the quality of discussion. It’s not obvious to me that Nate’s framing is the best framing for the question at hand, and I expect that making the case for Nate’s framing is likely to rely on the conjunction of many assumptions. Also, that’s fine! I think it’s a valuable argument to make! I just think there should be more explicit discussions and arguments about the best framings for predicting the future of AI.

Finally, I feel like asking for “a detailed technical argument for believing P(doom|AGI) ≤ 10%” is making an isolated demand for rigor. I personally don’t think there are ‘detailed technical arguments’ P(doom|AGI) greater than 10%. I don’t say this critically, because reasoning about the chances of doom given AGI is hard. I'm also >10% on many claims in the absence of 'detailed, technical arguments' for such claims in the absence of such arguments, and I think we can do a lot better than we're doing currently.

I agree that it’s important to avoid squeamishness about proclamations of confidence in pessimistic conclusions if that’s what we genuinely believe the arguments suggest. I'm also glad that you offered the 'social explanation' for people's low doom estimates, even though I think it's incorrect, and even though many people (including, tbh, me) will predictably find it annoying. In the same spirit, I'd like to offer an analogous argument: I think many arguments for p(doom | AGI) > 90% are the result of overreliance on specific default frame, and insufficiently careful attention to argumentative rigor. If that claim strikes you as incorrect, or brings obvious counterexamples to mind, I'd be interested to read them (and to elaborate my dissatisfaction with existing arguments for high doom estimates).

Violet Hour's Quick takes

Violet Hour3y3

thnx! : )

Your analogy successfully motivates the “man, I’d really like more people to be thinking about the potentially looming Octopcracy” sentiment, and my intuitions here feel pretty similar to the AI case. I would expect the relevant systems (AIs, von-Neumann-Squidwards, etc) to inherit human-like properties wrt human cognition (including normative cognition, like plan search), and a small-but-non-negligible chance that we end up with extinction (or worse).

On maximizers: to me, the most plausible reason for believing that continued human survival would be unstable in Grace’s story either consists in the emergence of dangerous maximizers, or the emergence of related behaviors like rapacious influence-seeking (e.g., Part II of What Failure Looks Like). I agree that maximizers aren't necessary for human extinction, but it does seem like the most plausible route to ‘human extinction’ rather than ‘something else weird and potentially not great’.

Violet Hour's Quick takes

Violet Hour3y1

AI safety

Pushback appreciated! But I don’t think you show that “LLMs distill human cognition” is wrong. I agree that ‘next token prediction’ is very different to the tasks that humans faced in their ancestral environments, I just don’t see this as particularly strong evidence against the claim ‘LLMs distill human cognition’.

I initially stated that “LLMs distill human cognition” struck me as a more useful predictive abstraction than a view which claims that the trajectory of ML leads us to a scenario where future AIs, are “in the ways that matter”, doing something more like “randomly sampling from the space of simplicity-weighted plans”. My initial claim still seems right to me.

If you want to pursue the debate further, it might be worth talking about the degree to which you’re (un)convinced by Quintin Pope’s claims in this tweet thread. Admittedly, it sounds like you don’t view this issue as super cruxy for you:

“The cognitive machinery that represents human intelligence seems to be substantially decoupled from the cognitive machinery that represents human values”

I don’t know the literature on moral psychology, but that claim doesn’t feel intuitive to me (possibly I’m misunderstanding what you mean by ‘human values’; I’m also interested in any relevant sources). Some thoughts/questions:

Does your position rule out the claim that “humans model other human beings using the same architecture that they use to model themselves”?
- To me, this seems like an instance where ‘value reasoning’ and ‘descriptive reasoning’ rely on similar cognitive resources. If LLMs inherit this human-like property (Quintin claims they do), would that update you towards optimism? If not, why not?
I take it that the notion of ‘intelligence’ we’re working with is related to planning. If future AI systems inherit human-like cognition wrt plan search, then I think this is a reason to expect that AI cognition will also inherit not-completely-alien-to-human values — even if there are, in some sense, distinct cognitive mechanisms undergirding ‘values’ and ‘non-values’ reasoning in humans.
- This is because the ‘search over plans’ process has both normative and descriptive components. I don’t think the claim about LLMs distilling human cognition constitutes anything like a guarantee that future LLMs will have values we’d really like, and nor is it a call for complacency about the emergence of misaligned goals. I just think it constitutes meaningful evidence against the human extinction claim.
  - As I write this, I’m starting to think that your claim about distinct cognitive mechanisms primarily seems like an argument for doom conditioned on ‘LLMs mostly don’t distill human cognition’, but doesn’t seem like an independent argument for doom conditioned on LLMs distilling human cognition. If LLMs distill the plan search component of human cognition, this feels like a meaningful update against doom. If LLMs mostly fail to distill the parts of human cognition involved in plan search, then cognitive convergence might happen because (e.g.) the Natural Abstraction Hypothesis is true, and 'human values' aren't a natural abstraction. In that case, it seems correct to say that cognitive convergence constitutes, at best, a small update against doom. (The cognitive convergence would occur due to structural properties of patterns in the world, rather than arising as the result of LLMs distilling more specifically human thought patterns related to values)
  - So I feel like ‘the degree to which we should expect future AIs to converge with human-like cognitive algorithms for plan search’ might be a crux for you?

Violet Hour's Quick takes

Violet Hour3y5

AI safety

A working attempt to sketch a simple three-premise argument for the claim: ‘TAI will result in human extinction’, and offer objections. Made mostly for my own benefit while working on another project, but I thought it might be useful to post here.

The structure of my preferred argument is similar to an earlier framing suggested by Katja Grace.

Goal-directed superhuman AI systems will be built (let’s say conditioned on TAI).
If goal-directed superhuman AI systems are built, their values will result in human extinction if realized.
If goal-directed superhuman AI systems are built, they’ll be able to realize their values — even if their values would result in human extinction if realized.
Thus: Humanity will go extinct.

I’ll offer some rough probabilities, but the probabilities I’m offering shouldn’t be taken seriously. I don’t think probabilities are the best way to adjudicate disputes of this kind, but I thought offering a more quantitative sense of my uncertainty (based on my immediate impressions) might be helpful in this case. For the (respective) premises, I might go for 98%, 7%, 83%, resulting in a ~6% chance of human extinction given TAI.

Some more specific objections:

Obviously Premise 2 is doing a lot of the work here. I think that one of the main arguments for believing in Premise 2 is a view like Rob’s, which holds that current ML is on track to produce systems which are, “in the ways that matter”, more like ‘randomly sample (simplicity-weighted) plans’ than anything recognizably human. If future systems are sampling from simplicity-weighted plans to achieve arbitrary goals, then Premise 2 does start to look very plausible.
- This basically just seems like an extremely strong claim about the inductive biases of ML systems, and my (likely unsatisfying) response boils down to: (1) I don’t see any strong argument for believing it, and (2) I see some arguments for the alternative conclusion.
- I find myself really confused when trying to think about this debate. In a discussion of Rob’s post, Daniel Kokotajlo says: “IMO the burden of proof is firmly on the side of whoever wants to say that therefore things will probably be fine.”
- I think I just don’t get the intuition behind his argument (tagging @kokotajlod in case he wants to correct any misunderstandings). I don’t really like ‘burden of proof’ talk, but my instinct is to say “look, LLMs distill human cognition, much of this cognition implicitly contains plans, human-like value judgements, etc.” I start from a place where I currently believe “future systems have human-like inductive biases” will be a better predictive abstraction than “randomly sample from the space of simplicity-weighted plans”. And … I just don’t currently see the argument for rejecting my current view?
  - Perhaps there are near-term predictions which would help weigh on the dispute between the two hypotheses? I currently interpret the disagreement here as a disagreement about the relevant outcome space over which we should be uncertain, which feels hard to adjudicate. But, right now, I struggle to see the argument for the more doomy outcome space.
More on Premise 2: Paul Christiano offers various considerations which count against doom which appear to go through without having “solved alignment”. These considerations feel less forceful to me than the points in the bullet point above, but they still serve to make Premise 2 seem less likely.
- “Given how small the [resource costs of keeping humans around are], even tiny preferences one way or the other will dominate incidental effects from grabbing more resources”.
- “There are just a lot of plausible ways to care a little bit (one way or the other!) about a civilization that created you, that you've interacted with, and which was prima facie plausibly an important actor in the world”
- “Most humans and human societies would be willing to spend much more than 1 trillionth of their resources (= $100/year for all of humanity) for a ton of random different goals”
- Paul also mentions “decision-theoretic arguments for cooperation”, including a passing reference to ECL.

I also think the story by Katja Grace below is plausible, in which superhuman AI systems are “goal-directed”, but don’t lead to human extinction.

AI systems proliferate, and have various goals. Some AI systems try to make money in the stock market. Some make movies. Some try to direct traffic optimally. Some try to make the Democratic party win an election. Some try to make Walmart maximally profitable. These systems have no perceptible desire to optimize the universe for forwarding these goals because they aren’t maximizing a general utility function, they are more ‘behaving like someone who is trying to make Walmart profitable’. They make strategic plans and think about their comparative advantage and forecast business dynamics, but they don’t build nanotechnology to manipulate everybody’s brains, because that’s not the kind of behavior pattern they were designed to follow. The world looks kind of like the current world, in that it is fairly non-obvious what any entity’s ‘utility function’ is. It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.

Perhaps the story above is unlikely because the AI systems in Grace’s story would (in the absence of strong preventative efforts) be dangerous maximizers. I think that this is most plausible on something like Eliezer’s model of agency, and if my views change my best bet is that I’ll have updated towards his view.

I believe: as you develop gradually more capable agentic systems, there are dynamic pressures towards a certain kind of coherency. I don’t think that claim alone establishes the existence of dynamic pressures towards ‘dangerous maximizing cognition’.
I think that AGI cognition (like our own) may well involve schemas, like (say) being loyal, or virtuous. We don’t argmax(virtue). Rather, the virtue schema also applies to the process by which we search over plans.
- So I don’t see why ‘having superhuman AIs run Walmart’ necessarily leads to doom, because they might just be implementing schemas like “be a good business professional”, rather than “find the function f(.) which is most ‘business-professional-like’, then maximize f(.) — regardless of whether any human would consider f(.) to represent anything corresponding to ‘good business professional.”
  - Alex Turner has a related comment here.
On Premise 3: I feel unsatisfied, so far, by accounts of AI takeover scenarios. Admittedly, it seems kinda mad for me to say “I’m confident that an AI with greater cognitive power than all of humanity couldn’t kill us if it wanted to”, which is one reason that I’m only at ~⅙ chance that we’d survive in that situation.
- But I also don’t know how much my conclusion is swayed by a sense of wanting to avoid the hubris of “man, it would be Really Dumb if you said a highly capable AI couldn’t kill us if it wanted to, and then we end up dead”, rather than a more obviously virtuous form of cognition.
- A system being ‘cognitively efficient wrt humanity’ doesn’t automatically entail ‘whatever goals the system has – and whatever constraints the system might otherwise face – the cognitively efficient system gets what it wants’. The arguments attempting to move from ‘AI with superhuman cognitive abilities’ to ‘human extinction’ feel fuzzier than I’d like.

If superhuman systems don’t foom, we might have marginally superhuman systems, who are able to be thwarted before they kill literally everyone (while still doing a lot of damage). Constraints like ‘accessing the relevant physical infrastructure’ might dominate the gains from greater cognitive efficiency.
- I also feel pretty confused about how much actual real-world power would be afforded to AIs in light of their highly advanced cognition (a relevant recent discussion), which further brings down my confidence in Premise 3.
- I’m also assuming that: conditioned on an AI instrumentally desiring to kill all humans, deceptive alignment is likely. I haven’t read posts like this one which might challenge that assumption. If I came to believe that deceptive alignment was highly unlikely, this could lower the probability of either Premise 2 or Premise 3.

Finally, I sometimes feel confused by the concept of ‘capabilities’ as it’s used in discussions about AGI. From Jenner and Treutlein’s response to Grace’s counterarguments:

Assuming it is feasible, the question becomes: why will there be incentives to build increasingly capable AI systems? We think there is a straightforward argument that is essentially correct: some of the things we care about are very difficult to achieve, and we will want to build AI systems that can achieve them. At some point, the objectives we want AI systems to achieve will be more difficult than disempowering humanity, which is why we will build AI systems that are sufficiently capable to be dangerous if unaligned.”

Maybe one thing I’m thinking here is that “more difficult” is hard to parse. The AI systems might be able to achieve some narrower outcome that we desire, without being “capable” of destroying humanity. I think this is compatible with having systems which are superhumanly capable of pursuing some broadly-scoped goals, without being capable of pursuing all broadly-scoped goals.

Psychologists talk of ‘g’ bc there’s correlation between performance on tasks we intuitively think of as cognitive, and correlations with some important life outcomes. I don’t know how well the unidimensional notion of intelligence will transfer to advanced AI systems. The fact that some AIs perform decently on IQ tests without being good at much else is at least some weak evidence against the generality of the more unidimensional ‘intelligence’ concept.
- However, I agree that there’s a well-defined sense in which we can say that AIs are more cognitively capable than all of humanity combined. I also think that my earlier point about expecting future systems to exhibit human-like inductive biases makes the argument in the bullet point above substantially weaker.
  - I still remain uneasy about the extent to which unidimensional notion of ‘capabilities’ can feed into claims about takeoffs and takeover scenarios, and I’m currently unclear on whether this makes a practical difference.

(Also, I’m no doubt missing a bunch of relevant information here. But this is probably true for most people, and I think it’s good for people to share objections even if they’re missing important details)

Violet Hour

Bio

Posts 7

Comments28

Nate’s Framing

Future Responses

Posts
7

Comments
28