Rohin Shah

4070 karmaJoined May 2015


Hi, I'm Rohin Shah! I work as a Research Scientist on the technical AGI safety team at DeepMind. I completed my PhD at the Center for Human-Compatible AI at UC Berkeley, where I worked on building AI systems that can learn to assist a human user, even if they don't initially know what the user wants.

I'm particularly interested in big picture questions about artificial intelligence. What techniques will we use to build human-level AI systems? How will their deployment affect the world? What can we do to make this deployment go better? I write up summaries and thoughts about recent work tackling these questions in the Alignment Newsletter.

In the past, I ran the EA UC Berkeley and EA at the University of Washington groups.


Research Scientist and Research Engineer roles in AI Safety and Alignment at Google DeepMind.

Location: Hybrid (3 days/week in the office) in San Francisco / Mountain View / London.

Application deadline: We don't have a final deadline yet, but will keep the roles open for at least another two weeks (i.e. until March 1, 2024), and likely longer.

For further details, see the roles linked above. You may also find my FAQ useful.

(Fyi, I probably won't engage more here, due to not wanting to spend too much time on this)

Jonas's comment is a high level assessment that is only useful insofar as you trust his judgment.

This is true, but I trust basically any random commenter a non-zero amount (unless their comment itself gives me reasons not to trust them). I agree you can get more trust if you know the person better. But even the amount of trust for "literally a random person I've never heard of" would be enough for the evidence to matter to me.

I'm only saying that I think large updates based on Jonas's statement are a mistake for people who already know Owen was an EA leader in good standing for many years and had many highly placed friends.

SBF was an EA leader in good standing for many years and had many highly placed friends. It's pretty notable to me that there weren't many comments like Jonas's for SBF, while there are for Owen.

In contrast, lyra's comment contains a lot of details I can use to inform my own reasoning.

It seems so noisy to compare karma counts on two different counts. There are all sorts of things we could be failing to miss about why people voted the way they did. Maybe people are voting Jonas's comment up more because they liked that it went more out of its way to acknowledge that the past behavior was bad and that a temporary ban is good.

It seems like a mistake to treat karma as "the community's estimate of the evidence that the comment would provide to a new reader who knows that Owen was a leader in good standing but otherwise doesn't know anything about what's going on". I agree you'll find all sorts of ways that karma counts don't reflect that.

The evidence Jonas provides is equally consistent with “Owen has a flaw he has healed” and “Owen is a skilled manipulator who charms men, and harasses women”.

Surely there are a lot of other hypotheses as well, and Jonas's evidence is relevant to updating on those?

More broadly, I don't think there's any obvious systemic error going on here. Someone who knows the person reasonably well, giving a model for what the causes of the behavior were, that makes predictions about future instances, clearly seems like evidence one should take into account.

(I do agree the comment would be more compelling with more object-level details, but I don't think that makes it a systemic error to be happy with the comment that exists.)

Yeah, I don't think it's accurate to say that I see assistance games as mostly irrelevant to modern deep learning, and I especially don't think that it makes sense to cite my review of Human Compatible to support that claim.

The one quote that Daniel mentions about shifting the entire way we do AI is a paraphrase of something Stuart says, and is responding to the paradigm of writing down fixed, programmatic reward functions. And in fact, we have now changed that dramatically through the use of RLHF, for which a lot of early work was done at CHAI, so I think this reflects positively on Stuart.

I'll also note that in addition to the "Learning to Interactively Learn and Assist" paper that does CIRL with deep RL which Daniel cited above, I also wrote a paper with several CHAI colleagues that applied deep RL to solve assistance games.

My position is that you can roughly decompose the overall problem into two subproblems: (1) in theory, what should an AI system do? (2) Given a desire for what the AI system should do, how do we make it do that?

The formalization of assistance games is more about (1), saying that AI systems should behave more like assistants than like autonomous agents (basically the point of my paper linked above). These are mostly independent. Since deep learning is an answer to (2) while assistance games are an answer to (1), you can use deep learning to solve assistance games.

I'd also say that the current form factor of ChatGPT, Claude, Bard etc is very assistance-flavored, which seems like a clear success of prediction at least. On the other hand, it seems unlikely that CHAI's work on CIRL had much causal impact on this, so in hindsight it looks less useful to have done this research.

All this being said, I view (2) as the more pressing problem for alignment, and so I spend most of my time on that, which implies not working on assistance games as much any more. So I think it's overall reasonable to take me as mildly against work on assistance games (but not to take me as saying that it is irrelevant to modern deep learning).

Fyi, the list you linked doesn't contain most of what I would consider the "small" orgs in AI, e.g. off the top of my head I'd name ARC, Redwood Research, Conjecture, Ought, FAR AI, Aligned AI, Apart, Apollo, Epoch, Center for AI Safety, Bluedot, Ashgro, AI Safety Support and Orthogonal. (Some of these aren't even that small.) Those are the ones I'd be thinking about if I were to talk about merging orgs.

Maybe the non-AI parts of that list are more comprehensive, but my guess is that it's just missing most of the tiny orgs that OP is talking about (e.g. OP's own org, QURI, isn't on the list).

(EDIT: Tbc I'm really keen on actually doing the exercise of naming concrete examples -- great suggestion!)

:) I'm glad we got to agreement!

(Or at least significantly closer, I'm sure there are still some minor differences.)

On hits-based research: I certainly agree there are other factors to consider in making a funding decision. I'm just saying that you should talk about those directly instead of criticizing the OP for looking at whether their research was good or not.

(In your response to OP you talk about a positive case for the work on simulators, SVD, and sparse coding -- that's the sort of thing that I would want to see, so I'm glad to see that discussion starting.)

On VCs: Your position seems reasonable to me (though so does the OP's position).

On recommendations: Fwiw I also make unconditional recommendations in private. I don't think this is unusual, e.g. I think many people make unconditional recommendations not to go into academia (though I don't).

I don't really buy that the burden of proof should be much higher in public. Reversing the position, do you think the burden of proof should be very high for anyone to publicly recommend working at lab X? If not, what's the difference between a recommendation to work at org X vs an anti-recommendation (i.e. recommendation not to work at org X)? I think the three main considerations I'd point to are:

  1. (Pro-recommendations) It's rare for people to do things (relative to not doing things), so we differentially want recommendations vs anti-recommendations, so that it is easier for orgs to start up and do things.
  2. (Anti-recommendations) There are strong incentives to recommend working at org X (obviously org X itself will do this), but no incentives to make the opposite recommendation (and in fact usually anti-incentives). Similarly I expect that inaccuracies in the case for the not-working recommendation will be pointed out (by org X), whereas inaccuracies in the case for working will not be pointed out. So we differentially want to encourage the opposite recommendations in order to get both sides of the story by lowering our "burden of proof".
  3. (Pro-recommendations) Recommendations have a nice effect of getting people excited and positive about the work done by the community, which can make people more motivated, whereas the same is not true of anti-recommendations.

Overall I think point 2 feels most important, and so I end up thinking that the burden of proof on critiques / anti-recommendations should be lower than the burden of proof on recommendations -- and the burden of proof on recommendations is approximately zero. (E.g. if someone wrote a public post recommending Conjecture without any concrete details of why -- just something along the lines of "it's a great place doing great work" -- I don't think anyone would say that they were using their power irresponsibly.)

I would actually prefer a higher burden of proof on recommendations, but given the status quo if I'm only allowed to affect the burden of proof on anti-recommendations I'd probably want it to go down to ~zero. Certainly I'd want it to be well below the level that this post meets.

I'm not very compelled by this response.

It seems to me you have two points on the content of this critique. The first point:

I think it's bad to criticize labs that do hits-based research approaches for their early output (I also think this applies to your critique of Redwood) because the entire point is that you don't find a lot until you hit.

I'm pretty confused here. How exactly do you propose that funding decisions get made? If some random person says they are pursuing a hits-based approach to research, should EA funders be obligated to fund them?

Presumably you would want to say "the team will be good at hits-based research such that we can expect a future hit, for X, Y and Z reasons". I think you should actually say those X, Y and Z reasons so that the authors of the critique can engage with them; I assume that the authors are implicitly endorsing a claim like "there aren't any particularly strong reasons to expect Conjecture to do more impactful work in the future".

The second point:

Your statements about the VCs seem unjustified to me. How do you know they are not aligned? [...] I haven't talked to the VCs either, but I've at least asked people who work(ed) at Conjecture.

Hmm, it seems extremely reasonable to me to take as a baseline prior that the VCs are profit-motivated, and the authors explicitly say

We have heard credible complaints of this from their interactions with funders. One experienced technical AI safety researcher recalled Connor saying that he will tell investors that they are very interested in making products, whereas the predominant focus of the company is on AI safety.

The fact that people who work(ed) at Conjecture say otherwise means that (probably) someone is wrong, but I don't see a strong reason to believe that it's the OP who is wrong.

At the meta level you say:

I do not understand where the confidence with which you write the post (or at least how I read it) comes from.

And in your next comment:

I think we should really make sure that we say true things when we criticize people, quantify our uncertainty, differentiate between facts and feelings and do not throw our epistemics out of the window in the process

But afaict, the only point where you actually disagree with a claim made in the OP (excluding recommendations) is in your assessment of VCs? (And in that case I feel very uncompelled by your argument.)

In what way has the OP failed to say true things? Where should they have had more uncertainty? What things did they present as facts which were actually feelings? What claim have they been confident about that they shouldn't have been confident about?

(Perhaps you mean to say that the recommendations are overconfident. There I think I just disagree with you about the bar for evidence for making recommendations, including ones as strong as "alignment researchers shouldn't work at organization X". I've given recommendations like this to individual people who asked me for a recommendation in the past, on less evidence than collected in this post.)

Wait, you think the reason we can't do brain improvement is because we can't change the weights of individual neurons?

That seems wrong to me. I think it's because we don't know how the neurons work.

Did you read the link to Cold Takes above? If so, where do you disagree with it?

(I agree that we'd be able to do even better if we knew how the neurons work.)

Similarly I'd be surprised if you thought that beings as intelligent as humans could recursively improve NNs. Cos currently we can't do that, right?

Humans can improve NNs? That's what AI capabilities research is?

(It's not "recursive" improvement but I assume you don't care about the "recursive" part here.)

I think it's within the power of beings equally as intelligent as us (similarly as mentioned above I think recursive improvement in humans would accelerate if we had similar abilities).

Load more