David Krueger

87 karmaJoined


"With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly."  

I strongly disagree with this (and the title of the piece).  I've been having these arguments a lot recently, and I think these sorts of claims are emblamatic of a dangerously narrow view on the problem of AI x-safety, which I am disappointed to see seems quite popular.
A few reasons why this statement is misleading: 
* New capabilities ellicitation techniques arrive frequently and unpredictably (think chain of thought, e.g.)
* The capabilities of a system could be much greater than any particular LLM involved in that system (think tool use and coding).  On the current trajectory, LLMs will increasingly be heavily integrated into complex socio-technical systems.  The outcomes are unpredictable, but it's likely such systems will exhibit capabilities significantly beyond what can be predicted from evaluations.

You can try to account for the fact that you're competing against the entire world's ingenuity by your privileged access (e.g. for fine-tuning or white-box capabilities ellicitation methods), but this is unlikely to provide sufficient coverage.

EtA: Understanding whether and to what extent the original claim is true is something that would likely require years of research at a minimum. 

I recently learned that in law, there is a breakdown as:

  • Intent (~=misuse)
  • Oblique Intent (i.e. a known side effect)
  • Recklessness (known chance of side effect)
  • Negligence (should've known chance of side effect)
  • Accident (couldn't have been expected to know)

    This seems like a good categorization.

A cutting-edge algorithmic or architectural discovery coming out of China would be particularly interesting in this respect.

Kaiming He was at MSR in China when he invented ResNets in 2015.  Residual connections are part of transformers, and probably the 2nd most important architectural breakthrough in modern Deep Learning.

This very short book makes similar points and suggestions.  I found it to be a good read, and would recommend it:

"IS THAT CLEAR?: Effective communication in a multilingual world"

Thanks for writing this.  I continue to be deeply frustrated by the "accident vs. misuse" framing. 

In fact, one I am writing this comment because I think this post itself endorses that framing to too great an extent.  For instance, I do not think it is appropriate to describe this simply as an accident:

 engineers disabled an emergency brake that they worried would cause the car to behave overly cautiously and look worse than competitor vehicles.

I have a hard time imagining that they didn't realize this would likely make the cars less safe; I would say they made a decision to prioritize 'looking good' over safety, perhaps rationalizing it by saying it wouldn't make much difference and/or that they didn't have a choice because their livelihoods were at risk (which perhaps they were).

Now that I've got the whinging out of the way, I say thank you again for writing it, and that I found the distinction between "AI risks with structural causes" and "‘Non-AI’ risks partly caused by AI" quite valuable, and I hope it will be widely adopted.

I think this idea is worth an orders-of-magnitude deeper investigation than what you've described.  Such investigations seem worth funding.

It's also worth noting that OP's quotation is somewhat selective, here I include the sub-bullets:

Within 5 years: EA funding decisions are made collectively 

  •  First set up experiments for a safe cause area with small funding pots that are distributed according to different collective decision-making mechanisms 
    • Subject matter experts are always used and weighed appropriately in this decision mechanism
  • Experiment in parallel with: randomly selected samples of EAs are to evaluate the decisions of one existing funding committee - existing decision-mechanisms are thus ‘passed through’ an accountability layer
  • All decision mechanisms have a deliberation phase (arguments are collected and weighed publicly) and a voting phase (majority voting, quadratic voting..) 
  • Depending on the cause area and the type of choice, either fewer (experts + randomised sample of EAs) or more people (any EA or beyond) will take part in the funding decision. """

I strongly disagree with this response, and find it bizarre.  

I think assessing this post according to a limited number of possible theories of change is incorrect, as influence is often diffuse and hard to predict or measure.  

I agree with freedomandutility's description of this as an "isolated demand for [something like] rigor".

I'm curious to dig into this a bit more, and hear why you think these seem like fairy tales to you (I'm not saying that I disagree...).
I wonder if this comes down to different ideas of what "solve alignment" means (I see you put it in quotes...) 

1) Are you perhaps thinking that realistic "solutions to alignment" will carry a significant alignment tax?  Else why wouldn't ~everyone adopt alignment techniques (that align AI systems with their preferences/values)?

2) Another source of ambiguity: there are a lot of different things people mean by "alignment", including:
* AI is aligned with objectively correct values
* AI is aligned with a stakeholder and consistently pursues their interests
* AI does a particular task as intended/expected
Is one of these in particular (or something else) that you have in mind here?