Posit: Most AI safety people should work on alignment/safety challenges for AI tools that already have users (Stable Diffusion, GPT)

nonzerosum

It seems if we can't make the basic versions of these tools well aligned with us, we won't have much luck with future more advanced versions.

Therefore, all AI safety people should work on alignment and safety challenges with AI tools that currently have users (image generators, GPT, etc).

Agree? Disagree?

12 Reactions

Comments3

Sorted by

New & upvoted

Click to highlight new comments since: Today at 8:50 PM

Yonatan CaleDec 20 20225

Some researchers are working on making real world models more aligned, and they either work on the cutting edge (as you suggest here), or maybe on something smaller (if their research is easier to start on a smaller model, maybe).

Some researchers work on problems like Agent Foundations (~ what is the correct mathematical way to model agents, utility functions, and things like that), and I assume they don't use actual models to experiment with (yet).

Some researchers are trying to make tools that will help other researchers.

And there are other directions.

You can see many of the agendas here:

(My understanding of) What Everyone in Technical Alignment is Doing and Why

brookDec 21 20221

I think this is one reasonable avenue to explore alignment, but I don't want everybody doing it.

My impression is that AI researchers exist on a spectrum from only doing empirical work (of the kind you describe) to only doing theoretical work (like Agent Foundations), and most fall in the middle, doing some theory to figure out what kind of experiment to run, and using empirical data to improve their theories (a lot of science looks like this!).

I think all (or even a majority of) AI safety researchers moving to doing empirical work on current AI systems is unwise, for two reasons:

Bigger models have bigger problems.
1. Lessons learned from current misalignment may be necessary for aligning future models, but will certainly not be sufficient. For instance, GPT-3 will (we assume) never demonstrate deceptive alignment, because its model of the world is not broad enough to do so, but more complex AIs may do.
2. This is particularly worrying because we may only get one shot at spotting deceptive alignment! Thinking about problems in this class before we have direct access to models that could, even in theory, exhibit these problems seems both mandatory and a key reason alignment seems hard to me.
AI researchers are sub-specialised.
1. Many current researchers working in non-technical alignment, while they presumably have a decent technical background, are not cutting-edge ML engineers. There's not a 1:1 skill translation from 'current alignment researcher' to 'GPT-3 alignment researcher'
2. There is maybe some claim here that you could save money on current alignment researchers and fund a whole bunch of GPT alignment researchers, but I expect the exchange rate is pretty poor, or it's just not possible in the medium term to find sufficient people with a deep understanding of both ML and alignment.

The first one is the biggy. I can imagine this approach working (perhaps inefficiently) in a world were (1) were false and (2) were true, but I can't imagine this approach working in any worlds where (1) holds.

Noah ScalesDec 21 20221

Agree that some could. Since you brought it up, how would you align image generators? They're still dumb tools, do you mean align the users? Add safety features? Stable Diffusion had a few safeguards put in place, but users can easily disable them. Now it's generating typical porn and as well as more dangerous or harmful things, I suspect, but only because people are using it that way, not because it does that on its own. So yeah do you want Stable Diffusion source code to be removed from the web? I second the motion, lol.