Will Aldred

4743 karmaJoined

Sequences
1

CERI SRF '22

Comments
237

Topic contributions
19

Inspired by the last section of this post (and by a later comment from Mjreard), I thought it’d be fun—and maybe helpful—to taxonomize the ways in which mission or value drift can arise out of the instrumental goal of pursuing influence/reach/status/allies:

Epistemic status: caricaturing things somewhat

Never turning back the wheel

In this failure mode, you never lose sight of how x-risk reduction is your terminal goal. However, in your two-step plan of ‘gain influence, then deploy that influence to reduce x-risk,’ you wait too long to move onto step two, and never get around to actually reducing x-risk. There is always more influence to acquire, and you can never be sure that ASI is only a couple of years away, so you never get around to saying, ‘Okay, time to shelve this influence-seeking and refocus on reducing x-risk.’ What in retrospect becomes known as crunch time comes and goes, and you lose your window of opportunity to put your influence to good use.

Classic murder-Gandhi

Scott Alexander (2012) tells the tale of murder-Gandhi:

Previously on Less Wrong’s The Adventures of Murder-Gandhi: Gandhi is offered a pill that will turn him into an unstoppable murderer. He refuses to take it, because in his current incarnation as a pacifist, he doesn’t want others to die, and he knows that would be a consequence of taking the pill. Even if we offered him $1 million to take the pill, his abhorrence of violence would lead him to refuse.

But suppose we offered Gandhi $1 million to take a different pill: one which would decrease his reluctance to murder by 1%. This sounds like a pretty good deal. Even a person with 1% less reluctance to murder than Gandhi is still pretty pacifist and not likely to go killing anybody. And he could donate the money to his favorite charity and perhaps save some lives. Gandhi accepts the offer.

Now we iterate the process: every time Gandhi takes the 1%-more-likely-to-murder-pill, we offer him another $1 million to take the same pill again.

Maybe original Gandhi, upon sober contemplation, would decide to accept $5 million to become 5% less reluctant to murder. Maybe 95% of his original pacifism is the only level at which he can be absolutely sure that he will still pursue his pacifist ideals.

Unfortunately, original Gandhi isn’t the one making the choice of whether or not to take the 6th pill. 95%-Gandhi is. And 95% Gandhi doesn’t care quite as much about pacifism as original Gandhi did. He still doesn’t want to become a murderer, but it wouldn’t be a disaster if he were just 90% as reluctant as original Gandhi, that stuck-up goody-goody.

What if there were a general principle that each Gandhi was comfortable with Gandhis 5% more murderous than himself, but no more? Original Gandhi would start taking the pills, hoping to get down to 95%, but 95%-Gandhi would start taking five more, hoping to get down to 90%, and so on until he’s rampaging through the streets of Delhi, killing everything in sight.

The parallel here is that you can ‘take the pill’ to gain some influence, at the cost of focusing a bit less on x-risk. Unfortunately, like Gandhi, once you start taking pills, you can’t stop—your values change and you care less and less about x-risk until you’ve slid all the way down the slope.

It could be your personal values that change: as you spend more time gaining influence amongst policy folks (say), you start to genuinely believe that unemployment is as important as x-risk, and that beating China is the ultimate goal.

Or, it could be your organisation’s values that change: You hire some folks for their expertise and connections outside of EA. These new hires affect your org’s culture. The effect is only slight, at first, but a couple of positive feedback cycles go by (wherein, e.g., your most x-risk-focused staff notice the shift, don’t like it, and leave). Before you know it, your org has gained the reach to impact x-risk, but lost the inclination to do so, and you don’t have enough control to change things back.

Social status misgeneralization

You and I, as humans, are hardwired to care about status. We often behave in ways that are about gaining status, whether we admit this to ourselves consciously or not. Fortunately, when surrounded by EAs, pursuing status is a great proxy for reducing x-risk: it is high status in EA to be a frugal, principled, scout mindset-ish x-risk reducer.

Unfortunately, now that we’re expanding our reach, our social circles don’t offer the same proxy. Now, pursuing status means making big, prestigious-looking moves in the world (and making big moves in AI means building better products or addressing hot-button issues, like discrimination). It is not high status in the wider world to be an x-risk reducer, and so we stop being x-risk reducers.


I have no real idea which of these failure modes is most common, although I speculate that it’s the last one. (I’d be keen to hear others’ takes.) Also, to be clear, I don’t believe the correct solution is to ‘stay small’ and avoid interfacing with the wider world. However, I do believe that these failure modes are easier to fall into than one might naively expect, and I hope that a better awareness of them might help us circumvent them.

For what it’s worth, I find some of what’s said in this thread quite surprising.

Reading your post, I saw you describing two dynamics:

  1. Principles-first EA initiatives are being replaced by AI safety initiatives
  2. AI safety initiatives founded by EAs, which one would naively expect to remain x-risk focused, are becoming safety-washed (e.g., your BlueDot example)

I understood @Ozzie’s first comment on funding to be about 1. But then your subsequent discussion with Ozzie seems to also point to funding as explaining 2.[1]

While Open Phil has opinions within AI safety that have alienated some EAs—e.g., heavy emphasis on pure ML work[2]—my impression was that they are very much motivated by ‘real,’ x-risk-focused AI safety concerns, rather than things like discrimination and copyright infringement. But it sounds like you might actually think that OP-funded AI safety orgs are feeling pressure from OP to be less about x-risk? If so, this is a major update for me, and one that fills me with pessimism.

  1. ^

    For example, you say, “[OP-funded orgs] bow to incentives to be the very-most-shining star by OP’s standard, so they can scale up and get more funding. I would just make the trade off the other way: be smaller and more focused on things that matter.”

  2. ^

    At the expense of, e.g., more philosophical approaches

Hmm, I’m not a fan of this Claude summary (though I appreciate your trying). Below, I’ve made a (play)list of Habryka’s greatest hits,[1] ordered by theme,[2][3] which might be another way for readers to get up to speed on his main points:

Leadership

Reputation[5]

Funding

Impact

  1. ^

    ‘Greatest hits’ may be a bit misleading. I’m only including comments from the post-FTX era, and then, only comments that touch on the core themes. (This is one example of a great comment I haven’t included.)

  2. ^

    although the themes overlap a fair amount

  3. ^

    My ordering is quite different to the karma ordering given on the GreaterWrong page Habryka links to. I think mine does a better job of concisely covering Habryka’s beliefs on the key topics. But I’d be happy to take my list down if @Habryka disagrees (just DM me).

  4. ^

    For context, Zachary is CEA’s CEO.

  5. ^

Epistemic status: strong opinions, lightly held

I remember a time when an org was criticized, and a board member commented defending the org. But the board member was factually wrong about at least one claim, and the org then needed to walk back wrong information. It would have been clearer and less embarrassing for everyone if they’d all waited a day or two to get on the same page and write a response with the correct facts.

I guess it depends on the specifics of the situation, but, to me, the case described, of a board member making one or two incorrect claims (in a comment that presumably also had a bunch of accurate and helpful content) that they needed to walk back sounds… not that bad? Like, it seems only marginally worse than their comment being fully accurate the first time round, and far better than them never writing a comment at all. (I guess the exception to this is if the incorrect claims had legal ramifications that couldn’t be undone. But I don’t think that’s true of the case you refer to?)

A downside is that if an organization isn’t prioritizing back-and-forth with the community, of course there will be more mystery and more speculations that are inaccurate but go uncorrected. That’s frustrating, but it’s a standard way that many organizations operate, both in EA and in other spaces.

I don’t think the fact that this is a standard way for orgs to act in the wider world says much about whether this should be the way EA orgs act. In the wider world, an org’s purpose is to make money for its shareholders: the org has no ‘teammates’ outside of itself; no-one really expects the org to try hard to communicate what it is doing (outside of communicating well being tied to profit); no-one really expects the org to care about negative externalities. Moreover, withholding information can often give an org a competitive advantage over rivals.

Within the EA community, however, there is a shared sense that we are all on the same team (I hope): there is a reasonable expectation for cooperation; there is a reasonable expectation that orgs will take into account externalities on the community when deciding how to act. For example, if communicating some aspect of EA org X’s strategy would take half a day of staff time, I would hope that the relevant decision-maker at org X takes into account not only the cost/benefit to org X of whether or not to communicate, but also the cost/benefit to the wider community. If half a day of staff time helps others in the community better understand org X’s thinking,[1] such that, in expectation, more than half a day of (quality-adjusted) productive time is saved (through, e.g., community members making better decisions about what to work on), then I would hope that org X chooses to communicate.

When I see public comments about the inner workings of an organization by people who don’t work there, I often also hear other people who know more about the org privately say “That’s not true.” But they have other things to do with their workday than write a correction to a comment on the Forum or LessWrong, get it checked by their org’s communications staff, and then follow whatever discussion comes from it.

I would personally feel a lot better about a community where employees aren’t policed by their org on what they can and cannot say. (This point has been debated before—see saulius and Habryka vs. the Rethink Priorities leadership.) I think such policing leads to chilling effects that make everyone in the community less sane and less able to form accurate models of the world. Going back to your example, if there was no requirement on someone to get their EAF/LW comment checked by their org’s communications staff, then that would significantly lower the time and effort barrier to publishing such comments, and then the whole argument around such comments being too time-consuming to publish becomes much weaker.


All this to say: I think you’re directionally correct with your closing bullet points. I think it’s good to remind people of alternative hypotheses. However, I push back on the notion that we must just accept the current situation (in which at least one major EA org has very little back-and-forth with the community)[2]. I believe that with better norms, we wouldn’t have to put as much weight on bullets 2 and 3, and we’d all be stronger for it.

  1. ^

    Or, rather, what staff at org X are thinking. (I don’t think an org itself can meaningfully have beliefs: people have beliefs.)

  2. ^

    Note: Although I mentioned Rethink Priorities earlier, I’m not thinking about Rethink Priorities here.

the actions he [SBF] was convicted of are nearly universally condemned by the EA community

I don’t think that observing lots of condemnation and little support is all that much evidence for the premise you take as given—that SBF’s actions were near-universally condemned by the EA community—compared to meaningfully different hypotheses like “50% of EAs condemned SBF’s actions.”

There was, and still is, a strong incentive to hide any opinion other than condemnation (e.g., support, genuine uncertainty) over SBF’s fraud-for-good ideology, out of legitimate fear of becoming a witch-hunt victim. By the law of prevalence, I therefore expect the number of EAs who don’t fully condemn SBF’s actions to be far greater than the number who publicly express opinions other than full condemnation.

(Note: I’m focusing on the morality of SBF’s actions, and not on executional incompetence.)

Anecdotally, of the EAs I’ve spoken to about the FTX collapse with whom I’m close—and who therefore have less incentive to hide what they truly believe from me—I’d say that between a third and a half fall into the genuinely uncertain camp (on the moral question of fraud for good causes), while the number in the support camp is small but not zero.[1]

  1. ^

    And of those in my sample in the condemn camp, by far the most commonly-cited reason is timeless decision theory / pre-committing to cooperative actions, which I don’t think is the kind of reason one jumps to when one hears that EAs condemn fraud for good-type thinking.

Importance of the digital minds stuff compared to regular AI safety; how many early-career EAs should be going into this niche? What needs to happen between now and the arrival of digital minds? In other words, what kind of a plan does Carl have in mind for making the arrival go well? Also, since Carl clearly has well-developed takes on moral status, what criteria he thinks could determine whether an AI system deserves moral status, and to what extent.

Additionally—and this one's fueled more by personal curiosity than by impact—Carl's beliefs on consciousness. Like Wei Dai, I find the case for anti-realism as the answer to the problem of consciousness weak, yet this is Carl's position (according to this old Brian Tomasik post, at least), and so I'd be very interested to hear Carl explain his view.

Thank you for engaging. I don’t disagree with what you’ve written; I think you have interpreted me as implying something stronger than what I intended, and so I’ll now attempt to add some colour.

That Emily and other relevant people at OP have not fully adopted Rethink’s moral weights does not puzzle me. As you say, to expect that is to apply an unreasonably high funding bar. I am, however, puzzled that Emily and co. appear to have not updated at all towards Rethink’s numbers. At least, that’s the way I read:

  • We don’t use Rethink’s moral weights.
    • Our current moral weights, based in part on Luke Muehlhauser’s past work, are lower. We may update them in the future; if we do, we’ll consider work from many sources, including the arguments made in this post.

If OP has not updated at all towards Rethink’s numbers, then I see three possible explanations, all of which I find unlikely, hence my puzzlement. First possibility: the relevant people at OP have not yet given the Rethink report a thorough read, and have therefore not updated. Second: the relevant OP people have read the Rethink report, and have updated their internal models, but have not yet gotten around to updating OP’s actual grantmaking allocation. Third: OP believes the Rethink work is low quality or otherwise critically corrupted by one or more errors. I’d be very surprised if one or two are true, given how moral weight is arguably the most important consideration in neartermist grantmaking allocation. I’d also be surprised if three is true, given how well Rethink’s moral weight sequence has been received on this forum (see, e.g., comments here and here).[1] OP people may disagree with Rethink’s approach at the independent impression level, but surely, given Rethink’s moral weights work is the most extensive work done on this topic by anyone(?), the Rethink results should be given substantial weight—or at least non-trivial weight—in their all-things-considered views?

(If OP people believe there are errors in the Rethink work that render the results ~useless, then, considering the topic’s importance, I think some sort of OP write-up would be well worth the time. Both at the object level, so that future moral weight researchers can avoid making similar mistakes, and to allow the community to hold OP’s reasoning to a high standard, and also at the meta level, so that potential donors can update appropriately re. Rethink’s general quality of work.)

Additionally—and this is less important, I’m puzzled at the meta level at the way we’ve arrived here. As noted in the top-level post, Open Phil has been less than wholly open about its grantmaking, and it’s taken a pretty not-on-the-default-path sequence of events—Ariel, someone who’s not affiliated with OP and who doesn’t work on animal welfare for their day job, writing this big post; Emily from OP replying to the post and to a couple of the comments; me, a Forum-goer who doesn’t work on animal welfare, spotting an inconsistency in Emily’s replies—to surface the fact that OP does not give Rethink’s moral weights any weight.

  1. ^

    Edited to add: Carl has left a detailed reply below, and it seems that three is, in fact, what has happened.

Here, you say, “Several of the grants we’ve made to Rethink Priorities funded research related to moral weights.” Yet in your initial response, you said, “We don’t use Rethink’s moral weights.” I respect your tapping out of this discussion, but at the same time I’d like to express my puzzlement as to why Open Phil would fund work on moral weights to inform grantmaking allocation, and then not take that work into account.

The "EA movement", however you define it, doesn't get to control the money and there are good reasons for this.

I disagree, for the same reasons as those given in the critique to the post you cite. Tl;dr: Trades have happened, in EA, where many people have cast aside careers with high earning potential in order to pursue direct work. I think these people should get a say over where EA money goes.

Directionally, I agree with your points. On the last one, I’ll note that counting person-years (or animal-years) falls naturally out of empty individualism as well as open individualism, and so the point goes through under the (substantively) weaker claim of “either open or empty individualism is true”.[1]

(You may be interested in David Pearce’s take on closed, empty, and open individualism.)

  1. ^

    For the casual reader: The three candidate theories of personal identity are empty, open, and closed individualism. Closed is the common sense view, but most people who have thought seriously about personal identity—e.g., Parfit—have concluded that it must be false (tl;dr: because nothing, not memory in particular, can “carry” identity in the way that's needed for closed individualism to make sense). Of the remaining two candidates, open appears to be the fringe view—supporters include Kolak, Johnson, Vinding, and Gomez-Emilsson (although Kolak's response to Cornwall makes it unclear to what extent he is indeed a supporter). Proponents of (what we now call) empty individualism include Parfit, Nozick, Shoemaker, and Hume.

Load more